# Building a Simple sklearn-compatible Estimator

In this tutorial, we'll walk through the process of creating a sklearn-compatible class for a simple statistical operation: calculating Z-scores. This example will help you understand how to create custom estimators that work within the sklearn ecosystem, with a quick tutorial on object oriented programming.

## 1. Introduction to Object-Oriented Programming (OOP)

Object-Oriented Programming is a programming paradigm that organizes code into objects, which are instances of classes. This approach helps in structuring code, making it more modular and easier to maintain.

### 1.1 Classes and Objects

A class is a blueprint for creating objects. It defines attributes (data) and methods (functions) that the objects will have.


In [1]:
class Dog:
    def __init__(self, name, age):
        self.name = name  # This is an instance variable
        self.age = age    # This is also an instance variable

    def bark(self):       # This is a method
        return f"{self.name} says Woof!"

# Creating an object (instance) of the Dog class
my_dog = Dog("Buddy", 3)

# Accessing instance variables
print(my_dog.name)  # Output: Buddy

# Calling a method
print(my_dog.bark())  # Output: Buddy says Woof!

Buddy
Buddy says Woof!



In this example:
- `Dog` is a class.
- `name` and `age` are instance variables (also called member variables or attributes).
- `bark()` is a method.
- `my_dog` is an object (instance) of the `Dog` class.
- `__init__` is a special method called a constructor. It's called when creating a new object.
- `self` refers to the instance of the class. It's used to access instance variables and methods.

### 1.2 Inheritance

Inheritance is a mechanism that allows a class to inherit attributes and methods from another class.


In [2]:
class Animal:
    def __init__(self, species):
        self.species = species

    def make_sound(self):
        return "Some generic animal sound"

class Dog(Animal):  # Dog inherits from Animal
    def __init__(self, name):
        super().__init__("Canine")  # Call the parent class's __init__
        self.name = name

    def make_sound(self):  # This overrides the method from Animal
        return "Woof!"

my_dog = Dog("Buddy")
print(my_dog.species)  # Output: Canine
print(my_dog.make_sound())  # Output: Woof!

Canine
Woof!



In this example:
- `Dog` is a subclass (child class) of `Animal`.
- `Dog` inherits the `species` attribute from `Animal`.
- `Dog` overrides the `make_sound()` method with its own implementation.
- `super().__init__()` calls the `__init__` method of the parent class.

### 1.3 Multiple Inheritance and Mixins

Python supports multiple inheritance, allowing a class to inherit from multiple parent classes. A Mixin is a class that provides methods to other classes but isn't meant to be instantiated on its own.


In [3]:

class Swimmer:
    def swim(self):
        return "I can swim!"

class Flyer:
    def fly(self):
        return "I can fly!"

class Duck(Animal, Swimmer, Flyer):
    def __init__(self, name):
        super().__init__("Avian")
        self.name = name

my_duck = Duck("Donald")
print(my_duck.swim())  # Output: I can swim!
print(my_duck.fly())   # Output: I can fly!

I can swim!
I can fly!



In this example, `Duck` inherits from `Animal` and also incorporates the `Swimmer` and `Flyer` mixins.  Of course, this is a simplistic example, and in "real" code, mixins and parent classes include functionality that child classes inherit for free.  The whole point is to make it easier for inheritors to write code that makes use of a common pool of functionality.

## 2. The Estimator API and Mixins in sklearn

Now that we understand the basics of OOP and inheritance, let's look at how sklearn uses these concepts.

### 2.1 BaseEstimator

`BaseEstimator` is a base class in sklearn that provides common functionality to all estimators:

```python
from sklearn.base import BaseEstimator

class MyEstimator(BaseEstimator):
    def __init__(self, param1=0, param2=1):
        self.param1 = param1
        self.param2 = param2

    def fit(self, X, y=None):
        # Implement the fitting logic here
        return self

    def predict(self, X):
        # Implement the prediction logic here
        pass
```

`BaseEstimator` provides methods like `get_params()` and `set_params()`, which enable model inspection and hyperparameter tuning.  By extending this class you can easily build new components that play nicely with other Sklearn components.

### 2.2 Mixins in sklearn

Sklearn also makes available several mixins that can be used to add specific functionality to estimators:

- `TransformerMixin`: Adds `fit_transform()` method
- `ClassifierMixin`: Adds `score()` method for classification tasks
- `RegressorMixin`: Adds `score()` method for regression tasks

Here's how the `TransformerMixin` works:

```python
class TransformerMixin:
    def fit_transform(self, X, y=None):
        return self.fit(X, y).transform(X)
```

By inheriting from `TransformerMixin`, a class automatically gets the `fit_transform()` method, which calls `fit()` and then `transform()`.


## 3. Building A Z-score Estimator for Using the SKLearn API

Before we dive into the code, let's understand what a Z-score is:

A Z-score (also called a standard score) measures how many standard deviations away a data point is from the mean of a dataset. The formula for a Z-score is:

$$Z = (X - \mu) / \sigma$$

Where:
- X is the raw score
- $\mu$ (mu) is the mean of the population
- $\sigma$ is the standard deviation of the population

Z-scores are useful for comparing values from different datasets or distributions.


Now let's create our Z-score calculator that conforms to the sklearn API:


In [4]:

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class ZScoreTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.mean_ = None  # Trailing underscore indicates estimated attribute
        self.std_ = None   # Trailing underscore indicates estimated attribute

    def fit(self, X, y=None):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean_) / self.std_

# We don't need to implement fit_transform() because TransformerMixin provides it




Let's break this down:

1. Our class inherits from `BaseEstimator` and `TransformerMixin`.
2. `__init__` initializes our attributes. The trailing underscores in `mean_` and `std_` indicate these are attributes estimated from data.
3. `fit()` calculates and stores the mean and standard deviation.
4. `transform()` applies the Z-score calculation.
5. We don't need to implement `fit_transform()` because `TransformerMixin` provides it.

## 4. Using Our Custom Estimator

Here's how we can use our custom estimator:


In [5]:
import numpy as np
from sklearn.model_selection import train_test_split

# Generate some random data
X = np.random.randn(100, 2)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# Create and fit our transformer
z_transformer = ZScoreTransformer()
z_transformer.fit(X_train)

# Transform the test data
X_test_transformed = z_transformer.transform(X_test)

print("Original data (first 5 rows):")
print(X_test[:5])
print("\nTransformed data (first 5 rows):")
print(X_test_transformed[:5])

Original data (first 5 rows):
[[ 0.56699586  0.77514674]
 [-1.17304336 -0.41577616]
 [ 0.46947063 -0.92225671]
 [ 0.70355083  0.13036693]
 [ 0.09095967 -0.62725398]]

Transformed data (first 5 rows):
[[ 0.6768307   0.78690639]
 [-1.27054608 -0.49347496]
 [ 0.56768468 -1.03800077]
 [ 0.82965712  0.09369271]
 [ 0.14407155 -0.72083835]]



## 5. Conclusion

By understanding OOP concepts like classes, inheritance, and mixins, we can create custom estimators that integrate seamlessly with sklearn's ecosystem. This allows us to extend sklearn's functionality while maintaining consistency with its API.

## 6. Exercise for Students

As an exercise, try to:
1. Add a parameter to `__init__` to optionally use median and interquartile range instead of mean and standard deviation.
2. Implement error handling for division by zero (when standard deviation is zero).
3. Add a method to inverse_transform the Z-scores back to the original scale.

Happy coding!



## 2. Basic Python Classes

Before we create our estimator, let's quickly review Python classes:


In [6]:

class Dog:
    def __init__(self, name):
        self.name = name
    
    def bark(self):
        return f"{self.name} says Woof!"

my_dog = Dog("Buddy")
print(my_dog.bark())  # Output: Buddy says Woof!


Buddy says Woof!




In this example:
- `__init__` is a special method (constructor) that initializes the object.
- `self` refers to the instance of the class.
- Methods are functions defined inside the class.

## 3. The Estimator API

Sklearn uses the Estimator API, which defines a common interface for all machine learning algorithms. The main methods we need to implement are:

- `fit(X, y=None)`: Calculates and stores the parameters needed for the transformation.
- `transform(X)`: Applies the transformation to the input data.

Optional methods include:
- `fit_transform(X, y=None)`: Fits the estimator and then transforms the data.
- `get_params()`: Returns the estimator's parameters.
- `set_params(**params)`: Sets the parameters of the estimator.

## 4. Building Our Z-score Estimator

Let's create a simple Z-score calculator that conforms to the sklearn API:


In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

# The parenthetical BaseEstimator and TransformerMixin are what's called

class ZScoreTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.mean_ = None
        self.std_ = None

    def fit(self, X, y=None):
        self.mean_ = np.mean(X, axis=0)
        self.std_ = np.std(X, axis=0)
        return self

    def transform(self, X):
        return (X - self.mean_) / self.std_

    def fit_transform(self, X, y=None):
        return self.fit(X).transform(X)




Let's break down the key components:

1. We inherit from `BaseEstimator` and `TransformerMixin`.
2. `__init__` initializes our attributes.
3. `fit` calculates and stores the mean and standard deviation.
4. `transform` applies the Z-score calculation.
5. `fit_transform` combines fitting and transforming in one step.

## 5. Understanding BaseEstimator and TransformerMixin

### BaseEstimator

`BaseEstimator` provides default `get_params()` and `set_params()` methods. These are useful for model inspection and hyperparameter tuning. While not strictly necessary, inheriting from `BaseEstimator` makes our estimator more compatible with sklearn's ecosystem.

If we didn't extend `BaseEstimator`, we'd need to implement these methods ourselves:


```python
class ZScoreTransformerWithoutBase:
    def __init__(self):
        self.mean_ = None
        self.std_ = None

    def get_params(self, deep=True):
        return {"mean_": self.mean_, "std_": self.std_}

    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            setattr(self, parameter, value)
        return self

    # ... rest of the methods ...
```

### TransformerMixin

`TransformerMixin` provides a default `fit_transform()` method that calls `fit()` and then `transform()`. It's a convenience class that saves us from writing this method ourselves.

Other common mixins in sklearn include:
- `RegressorMixin`: For regression algorithms
- `ClassifierMixin`: For classification algorithms
- `ClusterMixin`: For clustering algorithms

Each mixin provides some default behavior appropriate for its type of estimator.

## 6. Using Our Custom Estimator

Now let's see how to use our custom estimator:



In [7]:

import numpy as np
from sklearn.model_selection import train_test_split

# Generate some random data
X = np.random.randn(100, 2)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# Create and fit our transformer
z_transformer = ZScoreTransformer()
z_transformer.fit(X_train)

# Transform the test data
X_test_transformed = z_transformer.transform(X_test)

print("Original data (first 5 rows):")
print(X_test[:5])
print("\nTransformed data (first 5 rows):")
print(X_test_transformed[:5])

Original data (first 5 rows):
[[ 0.45820894 -1.53318002]
 [-0.37224735  0.62818689]
 [ 0.89868645  1.57903707]
 [ 0.91057867  0.23473719]
 [ 1.31144527  0.75723325]]

Transformed data (first 5 rows):
[[ 0.59634143 -1.71502361]
 [-0.29779675  0.4750514 ]
 [ 1.07059608  1.43853115]
 [ 1.08340023  0.07637576]
 [ 1.51500649  0.6058118 ]]



This example demonstrates how to:
1. Create an instance of our custom transformer.
2. Fit the transformer to training data.
3. Transform test data.

## 7. Benefits of Conforming to the Estimator API

By conforming to sklearn's Estimator API, we gain several advantages:

1. **Consistency**: Our estimator works like any other sklearn estimator, making it easier for others (and ourselves) to use.
2. **Pipeline compatibility**: We can use our estimator in sklearn's Pipeline and FeatureUnion.
3. **Model selection tools**: We can use tools like GridSearchCV for hyperparameter tuning (if our estimator had hyperparameters).
4. **Validation tools**: We can use sklearn's cross-validation tools seamlessly.

## 8. Conclusion

You've now created a simple sklearn-compatible Z-score transformer! This approach allows you to integrate custom transformations seamlessly with sklearn's ecosystem.

## 9. Exercise for Students

As an exercise, try to extend this class to include:
1. A parameter in `__init__` to optionally use median and interquartile range instead of mean and standard deviation.
2. Error handling for division by zero (when standard deviation is zero).
3. A method to inverse_transform the Z-scores back to the original scale.

Happy coding!