<h1 style="text-align:center;">Understanding Estimators in Scikit-learn</h1>
<hr>

#  Estimators

In **scikit-learn**, any object that can **learn from data** is called an **Estimator**.  
The key identity of an estimator is the presence of a **`.fit()`** method.  
If an object implements `.fit()`, it means it can learn patterns or parameters from data.

---

##  What is an Estimator?
- An **Estimator** is any object in scikit-learn that has a **`.fit()`** method.
- The `.fit()` method allows the estimator to **learn from training data**.
- Examples:
  - **Machine Learning algorithms** (e.g., `LinearRegression`, `DecisionTreeClassifier`)
  - **Transformers** such as encoders or scalers

> **Example**:  
> - `LinearRegression.fit()` learns the model weights.  
> - `DecisionTreeClassifier.fit()` learns the tree structure.

---

##  Types of Estimators

### 1️ Predictors
- **Definition**: Estimators that **predict** outcomes on new/unseen data.
- **Identity**: Have a **`.predict()`** method.
- **Flow**: `fit()` → learn → `predict()` → generate predictions.
- **Examples**: All supervised ML algorithms like:
  - `LinearRegression`
  - `RandomForestClassifier`
  - `SVC`

---

### 2️ Transformers
- **Definition**: Estimators that **transform** data after learning from it.
- **Identity**: Have a **`.transform()`** or **`.fit_transform()`** method.
- **Flow**: `fit()` → learn → `transform()` → apply transformation.
- **Examples**:
  - `StandardScaler`
  - `OneHotEncoder`
  - `OrdinalEncoder`
  - `PCA`

---

##  Pure Estimators (Fit-Only Objects)
Some estimators only **learn from data** but neither predict nor transform.  
They implement only **`.fit()`** and return learned attributes (e.g., cluster labels).

- **Example**: `AgglomerativeClustering`
  - Learns clusters from data
  - Provides labels of which cluster each point belongs to
  - No `.predict()` or `.transform()` method

---

###  Table

| Type            | Key Method(s)           | Purpose                          | Examples                    |
|------------------|--------------------------|-----------------------------------|------------------------------|
| **Predictor**    | `fit()`, `predict()`     | Learn patterns & predict outputs | `LinearRegression`, `SVC`   |
| **Transformer**  | `fit()`, `transform()`   | Learn patterns & transform data  | `StandardScaler`, `PCA`     |
| **Fit-only**     | `fit()`                  | Learn structure only             | `AgglomerativeClustering`   |


## Types of Predictors

Predictors are typically categorized into two main groups based on the kind of output they generate: **Regression** and **Classification**.

---

### 1. Regression Predictors 

A **regression predictor** (or regressor) is used when the target value is a continuous number. Its goal is to predict a numerical quantity.

* **Output**: A continuous numerical value (e.g., 10.5, 35000, -2.1).
* **Example Use Cases**: Predicting house prices, stock values, or temperature.
* **Method**: They use the `.predict()` method to return the predicted numerical value.

---

### 2. Classification Predictors 

A **classification predictor** (or classifier) is used when the target value is a discrete category or class. Its goal is to assign a label to an observation.

* **Output**: A discrete class label (e.g., `0` or `1`, 'Spam' or 'Not Spam', 'Cat' or 'Dog').
* **Example Use Cases**: Email spam detection, image recognition, or medical diagnosis.

Classifiers have two important prediction methods:

* **`.predict()`**: This method returns the **final, most likely class label**. For a binary classification problem, it will output either `0` or `1`. It gives a direct answer.

* **`.predict_proba()`**: This method returns the **probability of the data point belonging to each class**. For a binary problem, it will output an array like `[probability_of_class_0, probability_of_class_1]`. This is very useful for understanding the model's confidence in its prediction.

##  Custom Estimators

In scikit-learn there are many **estimators** (predictors like ML algorithms) available by default.  
But sometimes an algorithm you need may **not be present**.  
If you want to use an algorithm that scikit-learn does not provide, the solution is to create a **Custom Estimator**.

Custom estimators are **user-defined Python classes** that implement the scikit-learn estimator interface.  
They are designed to handle specific modeling or pre-processing needs that are not directly met by the built-in estimators.  
A custom estimator can be a small modification of an existing algorithm or an entirely new algorithm developed for a unique task.

Creating a custom estimator means defining a Python class that follows certain conventions and implements the required methods so it integrates smoothly with scikit-learn’s ecosystem, including **pipelines**, **cross-validation**, and **hyperparameter tuning**.

---

### Key Components of Custom Estimators

1. **Consistency with the Estimator Interface**  
   - At a minimum, a custom estimator must implement a **`fit()`** method.  
   - If it’s a transformer (for pre-processing), it should also implement a **`transform()`** method.  
   - If it’s a predictor (for modeling), it should implement a **`predict()`** method.

2. **Inheritance from Base Classes**  
   - Custom estimators usually inherit from **`BaseEstimator`** and, if needed,  
     **`ClassifierMixin`**, **`RegressorMixin`**, or **`TransformerMixin`**, depending on their purpose.  
   - This inheritance provides helpful methods and ensures compatibility with scikit-learn utilities.

3. **Parameters and Initialization**  
   - Parameters should be set through the constructor (**`__init__`** method), following scikit-learn’s pattern.  
   - All parameters must be explicitly declared in **`__init__`** and stored as instance variables.

4. **Parameter Validation**  
   - It is good practice to validate parameters and inputs using scikit-learn utilities like  
     **`check_X_y`** for input validation and **`check_array`** for validating feature data.

###  Example: Simple Majority Classifier

Let’s build a **simple classification algorithm** as a custom estimator.

- During **training (`fit`)**:  
  The estimator checks the training data and finds the **majority class** (the class that appears most often).
  
- During **prediction (`predict`)**:  
  For any new input data, it always outputs the **same majority class** that was found in the training data.

This is a minimal but valid custom classifier because:
- It **learns** from data during `fit()` (by identifying the majority class).
- It **predicts** on new data during `predict()` (by returning that majority class).


###  Mixin Classes in Scikit-learn

**Mixin classes** in scikit-learn are auxiliary classes that provide **extra methods and functionality** to custom estimators.  
They are designed to be used through **multiple inheritance**, allowing a custom estimator to gain standardized capabilities  
(such as scoring, fitting, predicting, or transforming) **without rewriting these methods from scratch**.

---

####  Key Mixins and Their Purpose

1. **Scoring**  
   - `ClassifierMixin` and `RegressorMixin` provide a default implementation of the **`.score()`** method.  
   - Useful for quickly evaluating classification or regression estimators.

2. **Transformation**  
   - `TransformerMixin` adds the **`.fit_transform()`** method.  
   - Allows a transformer to **learn and apply a transformation in a single step**.

3. **Fitting and Predicting**  
   - `ClusterMixin` provides the **`.fit_predict()`** method.  
   - Enables clustering estimators to **fit the data and return cluster labels** in one operation.

---

Using these mixins ensures:
- **Consistency** with scikit-learn’s API.
- **Less boilerplate code**, since common functionality is inherited automatically.


In [179]:
from sklearn.base import BaseEstimator, ClassifierMixin
import numpy as np
from sklearn.utils import check_X_y

In [180]:
class MostFrequentClassClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        self.most_frequent_ = None

    def fit(self, X, y):

        # Validate input X and target vector y
        X, y = check_X_y(X, y)

        # Ensure y is 1D
        y = np.ravel(y)

        # Manually compute the most frequent class
        unique_classes, counts = np.unique(y, return_counts=True)
        self.most_frequent_ = unique_classes[np.argmax(counts)]
        return self

    def predict(self, X):
        if self.most_frequent_ is None:
            raise ValueError("This classifier instance is not fitted yet.")
        # Predict the most frequent class for each input sample
        return np.full(shape=(X.shape[0],), fill_value=self.most_frequent_)

In [181]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Initialize and fit the custom estimator
classifier = MostFrequentClassClassifier()
classifier.fit(X_train, y_train)

# Make predictions
predictions = classifier.predict(X_test)

# Evaluate the custom estimator
print(f"Predicted class for all test instances: {predictions[0]}")

Predicted class for all test instances: 1


In [182]:
classifier.most_frequent_

np.int64(1)

In [183]:
from sklearn.model_selection import cross_val_score

cross_val_score(classifier, X_train, y_train)

array([0.34782609, 0.26086957, 0.27272727, 0.18181818, 0.31818182])

### Scoring function

In [185]:
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.metrics import accuracy_score
import numpy as np

class MostFrequentClassClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self):
        self.most_frequent_ = None

    def fit(self, X, y):
        # Ensure y is 1D
        y = np.ravel(y)

        # Compute the most frequent class
        unique_classes, counts = np.unique(y, return_counts=True)
        self.most_frequent_ = unique_classes[np.argmax(counts)]
        return self

    def predict(self, X):
        if self.most_frequent_ is None:
            raise ValueError("This classifier instance is not fitted yet.")
        # Predict the most frequent class for each input sample
        return np.full(shape=(X.shape[0],), fill_value=self.most_frequent_)

    def score(self, X, y):
        """Return the mean accuracy on the given test data and labels."""
        # Ensure y is 1D
        y = np.ravel(y)

        # Generate predictions
        predictions = self.predict(X)

        # Calculate and return the accuracy
        return accuracy_score(y, predictions)

In [186]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Load a dataset
iris = load_iris()
X, y = iris.data, iris.target

# Simplify to a binary classification problem
is_class_0_or_1 = y < 2
X_bin = X[is_class_0_or_1]
y_bin = y[is_class_0_or_1]

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_bin, y_bin, test_size=0.2, random_state=42)

# Initialize and fit the custom classifier
classifier = MostFrequentClassClassifier()
classifier.fit(X_train, y_train)

# Evaluate the classifier using the score method
score = classifier.score(X_test, y_test)
print(f"Accuracy of the MostFrequentClassClassifier: {score}")


Accuracy of the MostFrequentClassClassifier: 0.4


## Transformers

In scikit-learn, a **transformer** is a specific type of **estimator** designed to **transform datasets**.  
Transformers are primarily used for **pre-processing data**, enabling machine learning models to perform better by providing well-prepared input.  
The main goal of a transformer is to **modify or create new features** from the original dataset, making it more suitable for modeling.

---

###  Key Tasks of Transformers
- **Scaling numerical features**: Standardizing or normalizing features to a common scale.  
  *Example: `StandardScaler`, `MinMaxScaler`*  
- **Encoding categorical variables**: Converting categories into numeric representations.  
  *Example: `OneHotEncoder`, `OrdinalEncoder`*  
- **Handling missing values**: Filling in or imputing missing data.  
  *Example: `SimpleImputer`*  
- **Feature extraction or creation**: Generating new features from existing ones to capture important patterns.  
  *Example: `PCA`, `PolynomialFeatures`*

---

###  Key Methods
- **`fit()`** → Learns parameters from the training data (e.g., mean, variance, encoding map).  
- **`transform()`** → Applies the learned transformation to data.  
- **`fit_transform()`** → Combines both fit and transform in a single step for convenience.  

> Transformers **learn from data** but do not make predictions. They focus purely on preparing or modifying data for downstream modeling.

---

### Examples
- `StandardScaler`  
- `MinMaxScaler`  
- `OneHotEncoder`  
- `PCA`  

---

### Advantages
- Standardized and consistent **API** across scikit-learn.  
- Easily integrated into **pipelines** for automated workflows.  
- Reduces risk of data leakage by separating fitting and transforming.  
- Can handle a wide range of pre-processing tasks efficiently.

---

### Main Disadvantages
1. By default, transformers process **all features passed to them**, so if you want to transform only a subset of features, you need to manually select them.  
2. Cannot cover **every possible transformation use case**, so custom transformers may be needed for specialized tasks.

---

Transformers are essential in **feature engineering and preprocessing**, ensuring that data is in a suitable format for machine learning models.  
By learning from the training data and applying consistent transformations to new data, transformers help maintain **data integrity** and improve model performance.


In [178]:
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Generate some data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1, random_state=42)

# Use the transformer directly
X_transformed = StandardScaler().fit_transform(X)


LinearRegression().fit(X_transformed, y)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## Custom Transformers in Scikit-learn

Not every transformation you need is available in scikit-learn.  
When a built-in transformer does not meet your requirements, you can create a **custom transformer**.

---

###  What is a Custom Transformer?
A **custom transformer** is a user-defined class that implements the scikit-learn transformer interface.  
It allows you to:
- Apply transformations not provided by default transformers.
- Preprocess or modify data in a specific way for your problem.
- Integrate seamlessly with **pipelines**, **cross-validation**, and **model selection tools**.

---

###  Key Requirements
1. Must implement a **`fit()`** method to learn from training data.  
2. Must implement a **`transform()`** method to apply the transformation.  
3. Optionally, implement **`fit_transform()`** to combine both steps.  
4. Inherit from **`BaseEstimator`** and **`TransformerMixin`** for compatibility with scikit-learn utilities.  

---

###  Example Use Cases
- Custom scaling or normalization logic.  
- Domain-specific feature extraction.  
- Combining multiple preprocessing steps into a single transformer.


##  Custom Transformers

There are **two ways** to build a custom transformer in scikit-learn:  
1. **Functional approach**  
2. **Class approach**

---

###  Functional Approach
- Used when the transformer **does not need to learn from data**.  
- Creates **stateless transformers** that do not require a `fit()` method.  
- Suitable for transformations that **do not depend on training data**, e.g.:  
  - Squaring values (`x^2`)  
  - Cubing values (`x^3`)  
  - Reciprocal (`1/x`)  
  - Logarithm (`log(x)`)  
  - Trigonometric functions (`cos(x)`, `sin(x)`)  

> These are called **stateless transforms** because they do not store any learned parameters.

- In scikit-learn, the **`FunctionTransformer`** can be used to wrap such functions and create new custom stateless transformers.  

- Example use case: A transformer that computes the **cube of feature values**.


### Custom Transformer using Function Transformer

In [239]:
import numpy as np

def cube(x):

    return np.power(x,3)

In [240]:
from sklearn.preprocessing import FunctionTransformer

# Create the custom transformer
cube_transformer = FunctionTransformer(cube)

In [241]:
from sklearn.datasets import make_regression
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# Generate some data
X, y = make_regression(n_samples=100, n_features=2, noise=0.1, random_state=42)

# Use the transformer directly
X_transformed = cube_transformer.transform(X)

LinearRegression().fit(X_transformed, y)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


### Custom Transformer using BaseEstimator and TransformerMixin

## Class Approach

In scikit-learn, the **class approach** is used to create **stateful custom transformers** that **learn from the data** before applying transformations.  
Unlike the functional approach, where transformers are stateless, the class approach **stores parameters or metrics** derived from the training data, which are then used to transform new data consistently.

---

### Key Characteristics
1. **Stateful Transformers**  
   - Maintain internal state by learning statistics or parameters from the training data.  
   - Examples: mean, median, standard deviation, interquartile range (IQR), category mappings, etc.

2. **Separation of Fit and Transform**  
   - **fit()**: Computes and stores required parameters from the training data.  
   - **transform()**: Applies the learned transformation using the stored parameters.  
   - Ensures the same transformation can be applied consistently to new, unseen data.

3. **Inheritance for Compatibility**  
   - Typically inherit from **BaseEstimator** and **TransformerMixin**.  
   - Ensures compatibility with scikit-learn pipelines, cross-validation, and hyperparameter tuning.

4. **Custom Logic for Data-Dependent Transformation**  
   - Any transformation that depends on data statistics requires the class approach.  
   - Examples: scaling based on median/IQR, normalizing based on standard deviation, or encoding based on observed categories.

---

### Example Concept: MedianIQRScaler
- Suppose a numeric column `f1` has **median = 30** and **IQR = 45**.  
- The transformation for each value is calculated as:

  `(value - median) / IQR`

  Examples:  
  - For 36: `(36 - 30) / 45 = 0.133`  
  - For 38: `(38 - 30) / 45 = 0.178`

- Since this transformer **needs to compute median and IQR from the training data**, it is a **stateful transformer**, and the **class approach** is appropriate.

---

### When to Use the Class Approach
- For transformations **dependent on data-derived statistics**.  
- When the transformer must **store learned parameters** and **apply them consistently** to new data.  
- Examples include:  
  - Median/IQR scaling  
  - Standardization using mean and standard deviation  
  - Target encoding based on category statistics  
  - Custom normalization or feature engineering using training data metrics

> In short, the class approach is used for **data-dependent, stateful transformations** that ensure consistency and reproducibility in machine learning workflows.

In [245]:
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np

In [246]:
class MedianIQRScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.medians_ = None
        self.iqr_ = None

    def fit(self, X, y=None):
        # Calculate medians and interquartile range for each feature
        self.medians_ = np.median(X, axis=0)
        Q1 = np.percentile(X, 25, axis=0)
        Q3 = np.percentile(X, 75, axis=0)
        self.iqr_ = Q3 - Q1

        # Handle case where IQR is 0 to avoid division by zero during transform
        self.iqr_[self.iqr_ == 0] = 1
        return self

    def transform(self, X):
        # Check if fit has been called
        if self.medians_ is None or self.iqr_ is None:
            raise RuntimeError("The transformer has not been fitted yet.")

        # Scale features using median and IQR learned during fit
        return (X - self.medians_) / self.iqr_


In [247]:
from sklearn.datasets import make_blobs

# Generate synthetic data
X, _ = make_blobs(n_samples=100, n_features=2, centers=3, random_state=42)

# Initialize the transformer
scaler = MedianIQRScaler()

# Fit the scaler to the data
scaler.fit(X)

# Transform the data
X_scaled = scaler.transform(X)

# Check the first few rows of the transformed data
print("Transformed data (first 5 rows):")
print(X_scaled[:5])


Transformed data (first 5 rows):
[[-0.49872679 -0.71613207]
 [ 0.78423675 -0.08192868]
 [-0.03656645  0.52987512]
 [ 0.84159877 -0.09379661]
 [-0.3814692  -0.57206564]]


##  Composite Transformers in Scikit-learn

A **composite transformer** is a transformer built by combining **multiple transformers or estimators**.  
It allows you to **apply a series of transformations or processing steps** in a structured and organized way.

Composite transformers are especially useful when:
- Different transformations are needed for different columns or features.  
- You want to combine multiple preprocessing steps into a **single reusable workflow**.  
- You need to integrate transformations into **pipelines** for modeling and evaluation.

---

###  Types of Composite Transformers

1. **ColumnTransformer**  
   - Allows applying **different transformers to different columns** of a dataset.  
   - Useful for datasets with **mixed types of features** (numerical, categorical, text).  
   - Ensures that each column is processed appropriately while keeping the transformed output combined.

2. **Pipeline**  
   - Chains multiple transformers and a final estimator into a **sequential workflow**.  
   - Each step receives the output of the previous step.  
   - Simplifies repetitive preprocessing and modeling tasks, and ensures **consistent data flow**.

3. **FeatureUnion**  
   - Combines the output of **multiple transformers applied in parallel** into a single feature matrix.  
   - Useful when you want to **extract different features independently** and then merge them for downstream modeling.

---

###  Advantages of Composite Transformers
- Provides **modular and reusable workflows** for complex preprocessing tasks.  
- Maintains **consistency and reproducibility** across different datasets.  
- Fully compatible with scikit-learn **pipelines, cross-validation, and hyperparameter tuning**.  
- Makes it easier to **manage multiple transformations** in one place.

##  ColumnTransformer

The **ColumnTransformer** is a feature transformer in scikit-learn that allows you to **apply different transformations to different columns** or subsets of columns in a dataset.  
It then **concatenates the results** of all transformers into a single feature matrix.

This is especially useful for datasets with **mixed types of features**, such as:
- Numerical features that require scaling.  
- Categorical features that require encoding.  
- Other features that may need custom preprocessing.

---

###  What Can Be Used Inside a ColumnTransformer?
- **Transformers**: Built-in scikit-learn transformers like `StandardScaler`, `MinMaxScaler`, `OneHotEncoder`, etc.  
- **Function Transformers**: Stateless transformations using `FunctionTransformer`.  
- **Custom Transformers**: User-defined transformers built with the class approach.  
- **Pipelines**: Sequential workflows of transformers and estimators.  
- **Feature Unions**: Parallel combination of multiple transformers.

---

###  Advantages
- Allows **different preprocessing for different columns** in a single step.  
- Keeps transformations **organized and modular**.  
- Fully compatible with scikit-learn **pipelines, cross-validation, and grid search**.  
- Reduces the risk of **data leakage** by handling transformations in a controlled, column-specific manner.


In [272]:
import pandas as pd

# Define the data with numeric labels for sentiment
data = {
    "Social Media Platform": ["Twitter", "Facebook", "Instagram", "Twitter", "Facebook",
                              "Instagram", "Twitter", "Facebook", "Instagram", "Twitter"],
    "Review": ["Love the new update!", "Too many ads now", "Great for sharing photos",
               "Newsfeed algorithm is biased", "Privacy concerns with latest update",
               "Amazing filters!", "Too much spam", "Easy to connect with friends",
               "Stories feature is fantastic", "Customer support lacking"],
    "age": [21, 19, np.nan, 17, 24, np.nan, 30, 19, 16, 31],
    "Sentiment": [1, 0, 1, 0, 0, 1, 0, 1, 1, 0]  # Numeric labels: 1 for Positive, 0 for Negative
}

# Create a DataFrame
df = pd.DataFrame(data)

print(df)

  Social Media Platform                               Review   age  Sentiment
0               Twitter                 Love the new update!  21.0          1
1              Facebook                     Too many ads now  19.0          0
2             Instagram             Great for sharing photos   NaN          1
3               Twitter         Newsfeed algorithm is biased  17.0          0
4              Facebook  Privacy concerns with latest update  24.0          0
5             Instagram                     Amazing filters!   NaN          1
6               Twitter                        Too much spam  30.0          0
7              Facebook         Easy to connect with friends  19.0          1
8             Instagram         Stories feature is fantastic  16.0          1
9               Twitter             Customer support lacking  31.0          0


In [273]:
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

In [274]:
# Define the column transformer
column_transformer = ColumnTransformer(
    transformers=[
        ('platform_ohe', OneHotEncoder(), ['Social Media Platform']),
        ('review_bow', CountVectorizer(), 'Review'),
        ('age_impute', SimpleImputer(),['age'])
    ],
    remainder='drop'  # Drop other columns not specified in transformers
)

In [275]:
pd.DataFrame(column_transformer.fit_transform(df).toarray(),columns=column_transformer.get_feature_names_out())

Unnamed: 0,platform_ohe__Social Media Platform_Facebook,platform_ohe__Social Media Platform_Instagram,platform_ohe__Social Media Platform_Twitter,review_bow__ads,review_bow__algorithm,review_bow__amazing,review_bow__biased,review_bow__concerns,review_bow__connect,review_bow__customer,...,review_bow__sharing,review_bow__spam,review_bow__stories,review_bow__support,review_bow__the,review_bow__to,review_bow__too,review_bow__update,review_bow__with,age_impute__age
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,21.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,19.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.125
3,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,17.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,24.0
5,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.125
6,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,30.0
7,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,19.0
8,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0
9,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,31.0


##  FeatureUnion

**FeatureUnion** is a composite transformer that allows you to **apply multiple transformations in parallel** on the same dataset and then **combine the results into a single feature matrix**.  
It is similar to pipelines and column transformers, but the **key difference is its parallel processing**.

---

###  How FeatureUnion Works

- In a **Pipeline**, transformations are applied **sequentially**: each step depends on the output of the previous step.  
- In **FeatureUnion**, transformations are applied **independently and simultaneously** to the same input data.  
- After all transformers are applied, their outputs are **concatenated horizontally** to form a single feature matrix.

**Example Concept:**
- Suppose we have three numerical features: `f1`, `f2`, `f3`.  
- We want to apply two transformations **in parallel**:  
  1. `StandardScaler` → outputs 3 scaled features  
  2. `PCA` with 2 components → outputs 2 principal components  

- **FeatureUnion** applies both transformations **at the same time** to the input dataset.  
- The results are then **combined horizontally**, giving a total of **5 features** (3 from StandardScaler + 2 from PCA).

---

###  Key Characteristics

1. **Parallel Processing**
   - All transformers in a FeatureUnion **receive the same input data**.  
   - Each transformer produces its own output independently.

2. **Horizontal Concatenation**
   - Outputs of all transformers are joined **side by side** to form the final feature matrix.

3. **Cannot Select Specific Columns**
   - Unlike `ColumnTransformer`, FeatureUnion **does not allow assigning transformers to specific columns**.  
   - Every transformer is applied to the **entire input dataset**.

4. **Useful for Multiple Feature Extraction Techniques**
   - For example, consider text preprocessing:  
     - Bag-of-words → 50 features  
     - Word count → 1 feature  
     - Count of vowels → 1 feature  
   - FeatureUnion will apply all three transformations **in parallel** and combine the results → total 52 features.

5. **Integration with Pipelines**
   - FeatureUnion can be used inside a pipeline, allowing you to **combine parallel transformations with sequential preprocessing steps**.

---

###  Advantages of FeatureUnion
- Enables **parallel feature extraction**, making it faster and modular.  
- Combines **multiple transformations into a single feature matrix** automatically.  
- Ensures that different types of feature extraction or transformation can be applied **simultaneously** without manual concatenation.  
- Works seamlessly with scikit-learn pipelines, making workflows **clean, reproducible, and maintainable**.

---

###  When to Use FeatureUnion
- When you want to **apply multiple transformations independently** to the same dataset.  
- When combining **different feature extraction techniques**, such as text, numerical, or image features.  
- When you want to **merge multiple outputs horizontally** to create a richer feature set for machine learning models.


In [291]:
import pandas as pd
import numpy as np

# Generating a random dataset with 10 rows and 4 columns
np.random.seed(42)  # For reproducibility
data = np.random.randn(10, 4)

# Creating a DataFrame and naming the columns
df = pd.DataFrame(data, columns=['f1', 'f2', 'f3', 'y'])

df

Unnamed: 0,f1,f2,f3,y
0,0.496714,-0.138264,0.647689,1.52303
1,-0.234153,-0.234137,1.579213,0.767435
2,-0.469474,0.54256,-0.463418,-0.46573
3,0.241962,-1.91328,-1.724918,-0.562288
4,-1.012831,0.314247,-0.908024,-1.412304
5,1.465649,-0.225776,0.067528,-1.424748
6,-0.544383,0.110923,-1.150994,0.375698
7,-0.600639,-0.291694,-0.601707,1.852278
8,-0.013497,-1.057711,0.822545,-1.220844
9,0.208864,-1.95967,-1.328186,0.196861


In [292]:
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA

In [293]:
# Define FeatureUnion
feature_union = FeatureUnion([
    ('scaler', StandardScaler()),  # Apply StandardScaler
    ('pca', PCA(n_components=2))   # Apply PCA, reduce to 2 components
])

In [294]:
X_transformed = feature_union.fit_transform(df.drop(columns=['y']))

pd.DataFrame(X_transformed, columns=feature_union.get_feature_names_out())

Unnamed: 0,scaler__f1,scaler__f2,scaler__f3,pca__pca0,pca__pca1
0,0.815293,0.41836,0.947878,1.025659,-0.425413
1,-0.282292,0.302777,1.873701,1.772532,-0.358223
2,-0.635686,1.239158,-0.156427,0.327888,1.038742
3,0.432718,-1.721587,-1.410206,-1.911072,-0.68996
4,-1.451676,0.963905,-0.598312,-0.193153,1.371662
5,2.270396,0.312856,0.371269,0.51176,-0.891133
6,-0.74818,0.718778,-0.839795,-0.48428,1.020731
7,-0.832663,0.233387,-0.29387,-0.191723,0.583958
8,0.04908,-0.690119,1.121664,0.726878,-0.811461
9,0.383011,-1.777515,-1.015903,-1.584488,-0.838903


### Pipeline

A **Pipeline** in scikit-learn is a **composite transformer/estimator** that chains together a sequence of operations (transformers and/or predictors) and applies them **sequentially** to the entire dataset.  
It is one of the most important tools in scikit-learn for building clean, reproducible, and production-ready machine learning workflows.

---

###  Purpose of a Pipeline
- **Organize complex workflows** into a single, manageable object.
- Ensure that **all transformations** (scaling, encoding, feature selection, etc.) are applied in the **correct order** every time.
- Prevent **data leakage** by fitting all preprocessing steps only on the training set.
- Enable **parameter tuning** across the entire workflow using tools like `GridSearchCV` or `RandomizedSearchCV`.

---

###  How It Works
A pipeline works like a chain of steps. Each step is either:
- A **Transformer**: An object that implements `fit()` and `transform()`.
- The **Final Estimator**: An object that implements `fit()` and (optionally) `predict()`.

The pipeline calls each step in sequence:
1. The **first transformer** receives the raw input data and applies its transformation.
2. The **next transformer** takes the transformed output from the previous step and continues processing.
3. This continues until the **final estimator**, which produces predictions or final outputs.

---

#### Example Flow
Suppose we have three features `f1`, `f2`, `f3` and we want to:
1. **Scale** the data using `StandardScaler`.
2. **Reduce dimensions** using `PCA` to 2 components.
3. **Classify** using `LogisticRegression`.

Pipeline Steps:
1. Raw Data → **StandardScaler** →  
2️. Scaled Data → **PCA (2 components)** →  
3️. Reduced Data → **Logistic Regression → Predictions**  

The final output contains **predictions** (or transformed features if no estimator is provided).

---

###  Key Characteristics
- **Sequential Processing**: Each step receives the **output of the previous step** as input.
- **Entire Dataset**: Each transformer operates on the entire dataset passed from the previous step.
- **Single API**: The pipeline behaves like one estimator. You can call:
  - `fit()` to train all steps in sequence.
  - `transform()` to apply all transformations.
  - `predict()` if the final step is a predictor.

---

###  Advantages of Pipelines
- **Consistency**: Guarantees that the exact same transformations are applied to both training and testing data.
- **Cleaner Code**: Reduces boilerplate by combining all steps into one object.
- **Prevents Data Leakage**: Ensures that transformations like scaling or imputation are fitted only on training data and applied to test data consistently.
- **Model Selection**: Works seamlessly with `GridSearchCV` and `RandomizedSearchCV` for hyperparameter tuning across both preprocessing and modeling steps.
- **Reproducibility**: The entire workflow can be saved, loaded, and reused with the same configuration.

---

###  Difference from FeatureUnion
| Aspect                  | **Pipeline**                             | **FeatureUnion**                           |
|--------------------------|--------------------------------------------|----------------------------------------------|
| Processing Direction     | Sequential (step-by-step)                 | Parallel (all transformers at once)         |
| Input to Transformers    | Output of the previous step               | Same original dataset for all transformers   |
| Output Feature Space     | From the final step only                  | Horizontal concatenation of all outputs      |
| Use Case                 | When transformations must happen **in order** | When independent features need to be combined |

---

###  Best Practices
- The **final step** is usually a predictor (e.g., `LogisticRegression`, `RandomForest`), but it can also be a transformer if you only need transformed features.
- Use **ColumnTransformer inside a Pipeline** to handle datasets with mixed types (numerical + categorical).
- Always fit pipelines on **training data only**, then use the same pipeline to transform and predict on test data.

---
 
A **Pipeline** is ideal when you need to apply **multiple preprocessing steps and a model in a strict order**.  
It simplifies model building, reduces errors, and ensures a **reliable, production-ready workflow**.


In [310]:
import pandas as pd
import numpy as np

# Generating a random dataset with 10 rows and 4 columns
np.random.seed(42)  # For reproducibility
data = np.random.randn(10, 4)

# Creating a DataFrame and naming the columns
df = pd.DataFrame(data, columns=['f1', 'f2', 'f3', 'y'])

df

Unnamed: 0,f1,f2,f3,y
0,0.496714,-0.138264,0.647689,1.52303
1,-0.234153,-0.234137,1.579213,0.767435
2,-0.469474,0.54256,-0.463418,-0.46573
3,0.241962,-1.91328,-1.724918,-0.562288
4,-1.012831,0.314247,-0.908024,-1.412304
5,1.465649,-0.225776,0.067528,-1.424748
6,-0.544383,0.110923,-1.150994,0.375698
7,-0.600639,-0.291694,-0.601707,1.852278
8,-0.013497,-1.057711,0.822545,-1.220844
9,0.208864,-1.95967,-1.328186,0.196861


In [311]:
from sklearn.pipeline import Pipeline

In [312]:
# Define FeatureUnion
pipeline = Pipeline([
    ('scaler', StandardScaler()),  # Apply StandardScaler
    ('pca', PCA(n_components=2))
])

In [313]:
pd.DataFrame(pipeline.fit_transform(X), columns=pipeline.get_feature_names_out())

Unnamed: 0,pca0,pca1
0,-1.967289,-0.189197
1,0.980541,-1.140597
2,0.696192,1.069669
3,1.049030,-1.246446
4,-1.562347,-0.140628
...,...,...
95,0.673056,1.107907
96,0.960615,0.890179
97,1.014953,-0.817791
98,-1.401202,-0.194052


### Slightly Complex Example

In [314]:
import pandas as pd

# Define the data with numeric labels for sentiment
data = {
    "Social Media Platform": ["Twitter", "Facebook", "Instagram", "Twitter", "Facebook",
                              "Instagram", "Twitter", "Facebook", "Instagram", "Twitter"],
    "Review": ["Love the new update!", "Too many ads now", "Great for sharing photos",
               "Newsfeed algorithm is biased", "Privacy concerns with latest update",
               "Amazing filters!", "Too much spam", "Easy to connect with friends",
               "Stories feature is fantastic", "Customer support lacking"],
    "age": [21, 19, np.nan, 17, 24, np.nan, 30, 19, 16, 31],
    "Sentiment": [1, 0, 1, 0, 0, 1, 0, 1, 1, 0]  # Numeric labels: 1 for Positive, 0 for Negative
}

# Create a DataFrame
df = pd.DataFrame(data)

print(df)

  Social Media Platform                               Review   age  Sentiment
0               Twitter                 Love the new update!  21.0          1
1              Facebook                     Too many ads now  19.0          0
2             Instagram             Great for sharing photos   NaN          1
3               Twitter         Newsfeed algorithm is biased  17.0          0
4              Facebook  Privacy concerns with latest update  24.0          0
5             Instagram                     Amazing filters!   NaN          1
6               Twitter                        Too much spam  30.0          0
7              Facebook         Easy to connect with friends  19.0          1
8             Instagram         Stories feature is fantastic  16.0          1
9               Twitter             Customer support lacking  31.0          0


In [315]:
def count_words(reviews):
    # Count the number of words in each review
    # Assuming reviews is a 1D array-like of text strings
    return np.array([len(review.split()) for review in reviews]).reshape(-1, 1)

In [316]:
from sklearn.preprocessing import FunctionTransformer

# Create the FunctionTransformer using the count_words function
word_count_transformer = FunctionTransformer(count_words)

In [317]:
feature_union = FeatureUnion([
    ('word_count', word_count_transformer),
    ('bag_of_words', CountVectorizer())
])

In [318]:
column_transformer = ColumnTransformer(
    transformers=[
        ('age_imputer', SimpleImputer(strategy='mean'), ['age']),
        ('platform_ohe', OneHotEncoder(), ['Social Media Platform']),
        ('review_processing', feature_union, 'Review')
    ],
    remainder='drop'  # Drop other columns not specified here
)

In [319]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MaxAbsScaler
from sklearn.feature_selection import SelectKBest,chi2

final_pipeline = Pipeline(steps=[
    ('col_transformer', column_transformer),
    ('scaler', MaxAbsScaler()),
    ('selector', SelectKBest(score_func=chi2,k=10)),
    ('classifier', LogisticRegression())
])

In [320]:
final_pipeline.fit(df.drop(columns=['Sentiment']), df['Sentiment'])

0,1,2
,steps,"[('col_transformer', ...), ('scaler', ...), ...]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('age_imputer', ...), ('platform_ohe', ...), ...]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'error'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,func,<function cou...0020615F460C0>
,inverse_func,
,validate,False
,accept_sparse,False
,check_inverse,True
,feature_names_out,
,kw_args,
,inv_kw_args,

0,1,2
,input,'content'
,encoding,'utf-8'
,decode_error,'strict'
,strip_accents,
,lowercase,True
,preprocessor,
,tokenizer,
,stop_words,
,token_pattern,'(?u)\\b\\w\\w+\\b'
,ngram_range,"(1, ...)"

0,1,2
,copy,True

0,1,2
,score_func,<function chi...0020615F059E0>
,k,10

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100
