In [None]:
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
import geopandas as gpd
import numpy as np

# Transformers and Geopandas

<b>Note:</b> the sklearn transformer and the neural network transformer we'll mention later, and probably several other things named 'transformer' are all different things. 

## What is an SKLearn Transformer?

SKlearn transformers are classes that implement the `fit` and `transform` methods. They are used to preprocess data before feeding it to a model. In normal English, they are a step in data preparation pipelines that apply some type of clean-up step to the data. We use existing transformers all the time to impute, scale, or encode data.

We can also write our own custom transformers by extending the `BaseEstimator` and `TransformerMixin` classes from the `sklearn.base` module. This is useful when we need to apply some custom transformation that is not available in the existing transformers. As long as we provide the needed functionality, our custom transformer can be used in the same way as the built-in transformers.

## New - Pandas Got Better

Recently, I think in 2023, the usability of the pipeline transformers was improved a bit by allowing them to work with pandas dataframes, not only arrays. Dataframes are much easier to use as a human, so this will make steps using and tailoring transformers easier. In the examples later in Machine Learning, we assume we need an array, so we make arrays before starting the pipeline. We now have more flexibility to keep the data in that dataframe format longer, so we can handle it through the pipeline process with more ease. 

## What's a Pipeline?

A pipeline in sklearn is a sequence of steps that are applied to the data. In data science, we often need to load a large amount of data, and apply processing steps to that data in bulk to do things like impute, scale, or encode it. The pipeline is a way to automate this process so we can treat it as a group of steps, not a whole bunch of individual actions we need to manage bit by bit. 

In most cases, we will load our data, send it through the pipleline, and the output of that will go to the predictive modelling algorithm. All the preprocessing steps are done in the pipeline, so we can just focus on the model.

### Simple Pipeline Example

Here's a simple example of a pipeline - any data fed through this pipeline will have two steps applied to it - first the imputation (filling in blanks), then the scaling (making sure all the data is on the same scale). Each of these steps is a transformer - an object that we can make by extending some classes. These steps can be literally anything that we can imagine, as long as we meet the expectations of what a transformer needs (fit and transform methods) and we accept and return the data in the correct format. 

In [None]:
from sklearn import pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler

pipe = pipeline.Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])


We can also use the make_pipeline function to make a pipeline. It really doesn't matter, I personally never use make_pipeline, but it's there if you want it. The other way has the step name part built in, so that's a bit better.

In [None]:
from sklearn.pipeline import make_pipeline

pipe_for_fake_chumps = make_pipeline(
    SimpleImputer(strategy="mean"),
    StandardScaler()
)

#### Indentation

This is something of a personal preference, but I find it far easier to read for these types of things where the items are lined up as they are above. VS Code will usually do this pretty well by default, but it really is much more readable than listing them all in a line. 

In [None]:
df = pd.read_csv("../data/titanic_train.csv")
cat_cols = df.select_dtypes(include=['object']).columns
df.drop(cat_cols, axis=1, inplace=True)
df

In [None]:
pd_after = pipe.fit_transform(df)
pd.DataFrame(pd_after, columns=df.columns)

#### Results

What happened here? The original data went through the two steps, each of which applied some transformation that changed the data:
<ul>
<li> The imputer found each empty value, and replaced it with the mean of the column. </li>
<li> The scaler did a two step scaling process: </li>
    <ul>
    <li> The 'fit' part allowed the scaler to learn the mean and standard deviation of the data. </li>
    <li> The 'transform' part applied the scaling to the data - setting the values to be "standardized" - mean of 0 and standard deviation of 1. </li>
    <li> <b> The details of this should be covered in stats, we're just worried about the step here. </b> </li>
    </ul>
</ul>

### Pipeline Details

The pipeline is a pretty simple concept at its core - it is a series of steps where data goes in one end, each step does whatever it does, and the result comes out the end. Each pipeline step has two parts defined:
<ul>
<li> Name - this is how we refer to the step. When making more complex pipelines, we may need this. We can also grab a step by name. </li>
<li> Action - this is the transformer that is applied to the data. Any configuration goes in the constructor call here. </li>
</ul>

Once created, it does those steps to that data in that order. This is really useful in a real life scenario where we likely have lots of data either in an incoming flow or regular batches to be processed. If we make a pipeline, then all the processing is handled there - we can capture the data in its original format and trust that our pipeline will do the right thing to it.

<b>Note:</b> there are ways to have a smarter pipeline, that can do things like process part of the data in one way and part in another (like discreet and continuous data) - that uses the same pipeline framework, but with a couple of slightly more complex tools. 

In [None]:
pipe.named_steps

## Exercise - Make a Simple Pipeline

Make a pipeline for the titanic data. Try to have it do the following things:
<ul>
<li> Fill in missing values with the median value of that column. (Impute) </li>
<li> Scale the data so all the columns are on the same scale. (Use Min-max scaler) </li>
</ul>

If you're feeling ok here, try to incorporate some other steps. Most work with little to no modifications to the code you need to write. If you search for "sklearn pipeline" the documentation page has a link to a User Guide, which is a pretty good article with examples, there are some different transformers in there that you can try out. You can also just google "sklearn pipeline transformers" and find one that you can try. For the most part, these are all related to preparing data for ML modeling, so the details aren't something we have worried about yet, but the act of using the transformers is identical no matter what they actually do. 

In [None]:
# Do it. 

### Why Do I Care?

For the moment, we don't need to care about the details of those steps, but we do want to use those mechanics to do our work. The steps that the pipeline does can be anything we want, so if we have some spatial cleanup, we can make a transformer to do that, and add it to the pipeline!

In machine learning later on, we will use variations of these pipelines to load most of the data we use. It is even more seamless there, we load the data, then the pipeline feeds the output directly into the modelling algorithm - that wall of numbers version above is hidden from us.

### Pipeline Creation Options

There are a few ways to make a pipeline, and in general pipeline components can be connected together and swapped out as needed. This leans heavily on the duck typing pattern of thought we've been getting used to - we can swap things around, as long as it does what is required of it. 

## What Can Transformers Do?

Pretty much anything you want, some common examples include:
<ul>
<li> Imputing missing values </li>
<li> Scaling numerical features </li>
<li> Encoding categorical features </li>
<li> Extracting features from text </li>
<li> Reducing dimensionality </li>
</ul>

Basically anything we can express in a statement of "change the data in this way" can be a transformer.

### And How Do They Do It?

A transfomer is a pretty simple concept. It has two main methods:
<ul>
<li> `fit` - This method is used to learn the parameters of the transformation. For example, if we are scaling numerical features, the `fit` method will calculate the mean and standard deviation of each feature. </li>
    <ul>
    <li> In many cases, the `fit` method does nothing, it just returns 'self', but it is still required to be there. </li>
    <li> If we were imputing and inserting 'median' for missing values, this step would calculate the median of each feature. </li>
    </ul>
<li> `transform` - This method is used to apply the transformation to the data. For example, if we are scaling numerical features, the `transform` method will subtract the mean and divide by the standard deviation. </li>
</ul>

Each of these methods is automatically called by the sklearn pipeline when we insert them as steps. If we have any configuration, such as the number of features to extract or the method of scaling, we can pass these as arguments to the transformer's `__init__` method.

### Making a Customized Transformer

As long as we follow these expectations, we can make a pipeline step doing almost anything we want. We need to accept values in an expected format - a 2D array or dataframe, and return the values in that same correct format - the details can vary though, we can move, add, delete or change values, rows, and columns as we need. Each manipulation in the transform steps of transformers is basically a prescribed set of steps we could do to any spreadsheet of data. 

<b>Note:</b> the X copy thing isn't required, it is there as a safety thing to prevent accidental side effects. It's a generally good (or more accurately, safe) idea, but not a requirement. There's at least one solution below without it. The biggest risk here would be if you were to do something that changed the data in place, and then tried to use the original data later - for example, if your transformer has a sort, you don't want to accidentally sort the original data and then try to use it later. In normal usage in a pipeline, this isn't normally something that comes up. It is possible, and can be very confusing if it does. 

### Transformer TL;DR

So, we can make a transformer to do anything we want, we just need to follow some constraints:
<ul>
<li> We need to extend the BaseEstimator and TransformerMixin classes. </li>
<li> We need to implement the fit and transform methods. </li>
<li> We need to accept and return the data in the correct format - a 2D table in either DF or array. </li>
</ul>

In [None]:
class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self  # The fit method typically does nothing for transformers
        # This is mainly used when there is a 'configuration' 
        # step that needs to be done before the transformation
        # For example, when scaling data with standardization, 
        # we'd need the mean and std of the data - that's calculated here.
    def transform(self, X):
        # Your transformation logic goes here
        X_transformed = X.copy()  # Copy the input DataFrame to avoid modifying the original
        ############## Your code here ##############
        return X_transformed

### Using Custom Transformers

Adding a custom transformer to a pipeline is identical to using a premade one - that's one of the big benefits of the interchangeability of objects we have with python, duck typing, and inheritance - we only need to provide a tiny portion of the functionality that differs, not learn and reimplement the entire thing!

We must conform strictly to the input and output format, as well as the required methods, but beyond that we can do whatever we want. For this example below, we are adding two steps - the area calculator and the custom one. Each step gets a name to refer to it from here on out, and each init call can contain any arguments that we need or want to provide for configuration. 

## Exercise - Make a Custom Transformer

Make a custom transformer that adds a column to the data. The column should be the product of x and y.

If that works, add an optional parameter that allows the user to add a constant that defaults to 1, this constant should be multiplied by the product of x and y.

If all that works, add the ability to specify the column names for x and y, and name the output column. 

<b>Use your transformer by calling fit_transform, as well as in a pipeline.</b>

In [None]:
from sklearn.datasets import make_circles

X, y = make_circles(n_samples=1000, factor=0.5, noise=0.1)
df_xy = pd.DataFrame(X, columns=["x", "y"])
df_xy.head()

In [None]:
# Make class

In [None]:
# Use it

In [None]:
# Pipe it

## Geopandas and Pipelines

Geopandas is built "on top of" pandas, adding the ability to handle geospatial data to the already powerful pandas data manipulation library. We can use geopandas dataframes exactly as we would use a regular pandas dataframe, but with the added ability to handle geospatial data. Since geopandas dataframes are just pandas dataframes with some extra functionality, we can use them in sklearn pipelines as well.

There's little, if any, difference in when we make data a geodataframe. As a rough rule of thumb, if there's lots of manipulation to do with data, I'll load it as a pandas dataframe and make a geodataframe with that data; if there is not much processing, I'll read it with GeoPandas directly and set the geometry right up front. 

In [None]:
gpd_sample = gpd.read_file( "../data/03_city_edmonton_assessment_sample.csv", geometry="geometry")
gpd_sample.head()

In [None]:
pandas_sample = pd.read_csv("../data/03_city_edmonton_assessment_sample.csv")
pandas_sample = gpd.GeoDataFrame(pandas_sample, geometry=gpd.points_from_xy(pandas_sample["Latitude"], pandas_sample["Longitude"]))
pandas_sample.head()

### Realistic(-ish) Examples

Here are a couple of simple examples. The first adds an area column, the second generates points from lat/lon and places that into the geometry column. 

In [None]:
class areaGenerator(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X["area"] = self.area
        return X

In [None]:
class PointGenerator(BaseEstimator, TransformerMixin):
    def __init__(self, lon, lat):
        self.lon = lon
        self.lat = lat
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_transformed = X.copy()
        X_transformed["geometry"] = gpd.points_from_xy(X_transformed[self.lon], X_transformed[self.lat])
        return X_transformed

## Example - Spatial Join Transformer

In this example, we can use a spatial join to connect two datasets based on their location. Specifically, we can use the point position of items in our data to connect with the area polygons in a spatial dataset, and get the neighborhood label from that spatial dataset. To make this work, we'll need a few parts:
<ul>
<li> A geopandas dataframe with the spatial data. This is like a setting or configuration step, so it will be in the constructor. </li>
<li> A transformer that can take the 'regular' data we're using, perform the spatial join, and add the result in a new column. </li>
</ul>

As long as our input and outputs match the sklearn transformer format, we can use this in a pipeline just like any other transformer.

In [None]:
class Spatial_Joiner( BaseEstimator, TransformerMixin):
    # Still must complete. 
    def __init__(self, other):
        self.other = other
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X = gpd.sjoin(X, self.other, how="left", op="contains")
        return X

##### Prepare Data

I'll prepare the data here. The main thing is to ensure that geometry has properly captured the spatial data from the file. In this example, each column requires a little processing to get it into the right format.

In [None]:
booze = gpd.read_file('../data/Alcohol_Sales_Licences.csv', crs=6933)
booze['geometry'] = gpd.points_from_xy(booze['Longitude'], booze['Latitude'])
booze.set_geometry('geometry', inplace=True, crs=6933)

hoods = gpd.read_file('../data/Neighbourhood_Boundaries.csv')
hoods["geometry"] = gpd.GeoSeries.from_wkt(hoods["geom"])
hoods.set_geometry("geometry", inplace=True, crs=6933)
hoods.drop("geom", axis=1, inplace=True)

In [None]:
joinPipe = pipeline.Pipeline([
    ("joiner", Spatial_Joiner(other=booze))
])
tran_join_result = joinPipe.fit_transform(hoods)
tran_join_result.sample(10)

## Geospatial Transformations

With geospatial data, particularly for the things you're likely to be doing, these transformers and piplines can be used to make tools to process the data automatically into a format that provides what you need for analysis. We can create a pipeline that takes in raw data and outputs it in some format that we know we want - the steps to do those transformations are the transformers in the pipeline. Once built, we can process any new data with no additional effort by just running it through the pipeline - this means that we'd never do something like manually manipulate data in Excel or something like that, we'd always use the pipeline to do it for us.

For your applications, you'll get some data and the format you need it in might vary, or you likely might need more than one format. For example, if you were displaying some data in Tableau or Power BI, you can take some raw data, run it through a pipeline that calculates whatever values need to be displayed, then output that as a datasource for your visualization. The transformer step might do all kinds of stuff like calculate distance, do spatial joins to get region labels, calculate area, etc... This data can then be fed to the visualization tool, and it can make pretty pictures without having to manipulate data there. There can be multiple pipelines (or outputs) that each prepares one central source of data, automatically, for different purposes.

If this sounds similar to some ETL stuff you talked about in the database classes, that's because it is. 

In [None]:
class AddAreaCalculation( BaseEstimator, TransformerMixin):
    def __init__(self, column_name):
        self.column_name = column_name
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        X_transformed = X.copy()
        X_transformed[self.column_name] = X_transformed["geometry"].area
        return X_transformed

## Exercise - Do That With Transformers

The processing steps above can be replaced with a pipeline containing custom steps in transformers. For the booze, we want to create a point from the lat/lon columns and set that answer into the geometry column. For the area, we want to add an area column.

<b>Hint:</b> we can chain transformers together almost any way we want, they can be connected like pipes in almost any configuration. The easy way to do this might not be the most intuitive... 

![Pipes](../images/pipe.jpg "Pipes")

In [None]:
# Make transformers


In [None]:
booze2 = gpd.read_file('../data/Alcohol_Sales_Licences.csv')
hoods2 = gpd.read_file('../data/Neighbourhood_Boundaries.csv')

#Make Pipe

## Transformative Thoughts

As we make these transformers, there are a few things that we might want to consider more than we would in other scenarios:
<ul>
<li> Speed - these transfomrations may be run on all data, so efficiency might matter. </li>
    <ul>
    <li> We should avoid things that are obviously slow, like nested loops. </li>
    <li> Vectorization is a good goal - like np operations on arrays. </li>
    </ul>
<li> Safety - we should be careful to avoid side effects. </li>
    <ul>
    <li> We should avoid changing the input data in place as it introduces risk. Unlikely, but possible. </li>
    <li> If we make a transformer, we have 0 idea where it'll be used, so safety first. </li>
    <li> It isn't relevant right now, but it isn't uncommon for us to process some data and send it to a model, then use that data for something else like visualizations or another model. We don't want accidental changes, that are invisible, to the data behind our backs. In most cases, this won't be a real issue, but it is a very hard problem to trouble shoot if it does happen. </li>
    </ul>
<li> Flexibility - we should make sure that the transformer can handle a variety of data. By defult pipelines can work with data in arrays and dataframs in a way that is more-or-less seamless. We want the same behaviour in most cases. This usually isn't that big of a stretch, most other functions that we might need will work with different data types already, so as long as we don't introduce some object-specific action, we should be ok. </li>
</ul>

## Exercise - Make Some Transformers

For this exercise, create a transformer that adds a new column representing the distance between the two locations in the dataframe. The transformer should take the two columns containing the latitude and longitude of the two locations as input and add a new column with the distance between them.

Create a pipeline with transformers that:
<ul>
<li> Adds a new column with the distance to a set point. </li>
<li> Adds a new column with the distance to the nearest point in another dataset. </li>
<li> Adds a new column 'geometry' that is the point geometry, from lat and lon. </li>
<li> Add a transformer that filters things that are within a set distance of a point. </li>
</ul>

<b>Editor's Note:</b> I wrote these first, and they ended up being a bit redundant. I just didn't delete these prompts, but there's some overlap with the stuff above. 

In [None]:
# you know what to do.

In [None]:
ex_play = pd.read_csv("../data/playgrounds.csv")
ex_play.drop(columns=["Location", "Geometry Point"], inplace=True)
ex_play.head()

In [None]:
# Use it


In [None]:
# Set point test
nait_lat = 53.57030
nait_long = -113.50087
