# Exercise

Your job is to create a `class` called `Preprocessor`, which should have the following functionalities:

- read a file from its filename
- store the raw data as an attribute
- do encoding, either one hot encoder or label encoder
- use minmax scaler or standard scaler
- plot data on either a new axis or previously given one
- return the distance between any two points in the processed data

## `sklearn.preprocessing`

Up until this point we have used only manual preprocessing functions. Whilst this is acceptable and also useful for beginner programmers, it can be computationally inefficient, and manually coding many different encoding algorithms is usually not a good use of time, and also allows for preprocessing mistakes.

As such, we introduce the users to a software package that will be used in the rest of the lectures, `scikit-learn`. This is a python package used for simple machine learning tasks, and it is designed to be highly optimised.

There are many different processing functions that can be found in the `sklearn.preprocessing` module, the documentation for which can be found at this link: [https://scikit-learn.org/stable/modules/preprocessing.html](https://scikit-learn.org/stable/modules/preprocessing.html).

<b>The answers to this exercise will contain some examples of `sklearn.preprocessing` functions</b>, but it is important for users to to remember that coding simple functions like scalers and encoders has merits because it allows the user total control over the algorithms implemented. In the case of preprocessing, there is not a single <i>right way</i> of preprocessing the data, and hence we encourage users to only use `sklearn.preprocessing` functions when they fully understand the algorithms that have been implemented.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

class Preprocessor:
    def __init__(self, fname: str, **kwargs):
        self.raw = pd.read_csv(fname, **kwargs)
        self.data = self.raw.copy()
        
    def encode(self, kind: str):
        # update self.data with the appropriate encoder
        # >>>
        
        # <<<
        return
    
    def scale(self, kind: str):
        # update self.data with the appropriate scaler
        # >>>
        
        # <<<
        return
    
    def plot(self, x: str, y: str, c: str):
        # create axes and plot self.data onto it
        # >>>
        
        # <<<
        return
    
    def distance(self, index_1: int, index_2: int) -> float:
        # use previous notes and scipy to get euclidean distance 
        # between two points in self.data, by their index
        # >>>

        # <<<
        return 
    

## Part II

Now that you have created the class, it is time to use it to demonstrate why we do scaling and encoding at all. Use the `Preprocessor` class to import the following data, and for all of the following files, use the `Preprocessor` class to read the following files, and compare the distances between the data in the form of a histogram.

Here are the documents:

- `'data/300-data-1.csv'` [https://www.kaggle.com/camnugent/california-housing-prices](https://www.kaggle.com/camnugent/california-housing-prices)
- `'data/300-data-2.csv'` [https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs](https://www.kaggle.com/vinesmsuic/star-categorization-giants-and-dwarfs)
- `'data/300-data-3.csv'` [https://www.kaggle.com/prudhvignv/milk-grading](https://www.kaggle.com/prudhvignv/milk-grading)
