In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import geopandas as gpd

# Transformers and Geopandas

<b>Note:</b> the sklearn transformer and the neural network transformer we'll mention later, and probably several other things named 'transformer' are all different things. 

## What is an SKLearn Transformer?

SKlearn transformers are classes that implement the `fit` and `transform` methods. They are used to preprocess data before feeding it to a model. In normal English, they are a step in data preparation pipelines that apply some type of clean-up step to the data. We use existing transformers all the time to impute, scale, or encode data.

We can also write our own custom transformers by extending the `BaseEstimator` and `TransformerMixin` classes from the `sklearn.base` module. This is useful when we need to apply some custom transformation that is not available in the existing transformers. As long as we provide the needed functionality, our custom transformer can be used in the same way as the built-in transformers.

## New - Pandas Got Better

Recently, I think in 2023, the usability of the pipeline transformers was improved a bit by allowing them to work with pandas dataframes, not only arrays. Dataframes are much easier to use as a human, so this will make steps using and tailoring transformers easier. In the examples later in Machine Learning, we assume we need an array, so we make arrays before starting the pipeline. We now have more flexibility to keep the data in that dataframe format longer, so we can handle it through the pipeline process with more ease. 

In [None]:
class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, column_name):
        self.column_name = column_name
    def fit(self, X, y=None):
        return self  # The fit method typically does nothing for transformers
        # This is mainly used when there is a 'configuration' step that needs to be done before the transformation
        # For example, when scaling data with standardization, we'd need the mean and std of the data - that's calculated here.
    def transform(self, X):
        # Your transformation logic goes here
        X_transformed = X.copy()  # Copy the input DataFrame to avoid modifying the original
        X_transformed[self.column_name] = X_transformed[self.column_name].apply(lambda x: x * 2)  # Example transformation
        return X_transformed

## What Can Transformers Do?

Pretty much anything you want, some common examples include:
<ul>
<li> Imputing missing values </li>
<li> Scaling numerical features </li>
<li> Encoding categorical features </li>
<li> Extracting features from text </li>
<li> Reducing dimensionality </li>
</ul>

Basically anything we can express in a statement of "change the data in this way" can be a transformer.

### And How Do They Do It?

A transfomer is a pretty simple concept. It has two main methods:
<ul>
<li> `fit` - This method is used to learn the parameters of the transformation. For example, if we are scaling numerical features, the `fit` method will calculate the mean and standard deviation of each feature. </li>
    <ul>
    <li> In many cases, the `fit` method does nothing, it just returns 'self', but it is still required to be there. </li>
    </ul>
<li> `transform` - This method is used to apply the transformation to the data. For example, if we are scaling numerical features, the `transform` method will subtract the mean and divide by the standard deviation. </li>
</ul>

Each of these methods is automatically called by the sklearn pipeline when we insert them as steps. If we have any configuration, such as the number of features to extract or the method of scaling, we can pass these as arguments to the transformer's `__init__` method.

## Adding Transformer to a Pipeline

Adding a custom transformer to a pipeline is identical to using a premade one - that's one of the big benefits of the interchangeability of objects we have with python, duck  typing, and inheritance - we only need to provide a tiny portion of the functionality that differs, not learn and reimplement the entire thing!

We must conform strictly to the input and output format, as well as the required methods, but beyond that we can do whatever we want.

## Geopandas and Pipelines

Geopandas is built "on top of" pandas, adding the ability to handle geospatial data to the already powerful pandas data manipulation library. We can use geopandas dataframes exactly as we would use a regular pandas dataframe, but with the added ability to handle geospatial data. Since geopandas dataframes are just pandas dataframes with some extra functionality, we can use them in sklearn pipelines as well.

## Example - Spatial Join Transformer

In this example, we can use a spatial join to connect two datasets based on their location. Specifically, we can use the point position of items in our data to connect with the area polygons in a spatial dataset, and get the neighborhood label from that spatial dataset. To make this work, we'll need a few parts:
<ul>
<li> A geopandas dataframe with the spatial data. This is like a setting or configuration step, so it will be in the constructor. </li>
<li> A transformer that can take the 'regular' data we're using, perform the spatial join, and add the result in a new column. </li>
</ul>

As long as our input and outputs match the sklearn transformer format, we can use this in a pipeline just like any other transformer.

In [None]:
class Spatial_Joiner( BaseEstimator, TransformerMixin):
    # Still must complete. 
    def __init__(self, gdf, column_name):
        self.gdf = gdf
        self.column_name = column_name
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_transformed = X.copy()
        X_transformed = gpd.sjoin(X_transformed, self.gdf, how="left", op="intersects")
        return X_transformed

## Exercise - Add Distance in a Transformer

For this exercise, create a transformer that adds a new column representing the distance between the two locations in the dataframe. The transformer should take the two columns containing the latitude and longitude of the two locations as input and add a new column with the distance between them.

In [None]:
class Add_Distance_Transformer( BaseEstimator, TransformerMixin):
    def __init__(self, column_name):
        self.column_name = column_name
    def fit(self, X, y=None):
        return self  
    def transform(self, X):
        X_transformed = X.copy()  # Copy the input DataFrame to avoid modifying the original
        
        return X_transformed

## Geospatial Transformations

With geospatial data, particularly for the things you're likely to be doing, these transformers and piplines can be used to make tools to process the data automatically into a format that provides what you need for analysis. We can create a pipeline that takes in raw data and outputs it in some format that we know we want - the steps to do those transformations are the transformers in the pipeline. Once built, we can process any new data with no additional effort by just running it through the pipeline - this means that we'd never do something like manually manipulate data in Excel or something like that, we'd always use the pipeline to do it for us.

For your applications, you'll get some data and the format you need it in might vary, or you likely might need more than one format. For example, if you were displaying some data in Tableau or Power BI, you can take some raw data, run it through a pipeline that calculates whatever values need to be displayed, then output that as a datasource for your visualization. The transformer step might do all kinds of stuff like calculate distance, do spatial joins to get region labels, calculate area, etc... This data can then be fed to the visualization tool, and it can make pretty pictures without having to manipulate data there. There can be multiple pipelines (or outputs) that each prepares one central source of data, automatically, for different purposes.

If this sounds similar to some ETL stuff you talked about in the database classes, that's because it is. 