# Examples of tidy data and tidy tools

For Aalto University Machine Learning D course guest lecture 31.01.2022

Read more on tidy data [here](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)

Read more on tidy tools [here](https://cran.r-project.org/web/packages/tidyverse/vignettes/manifesto.html)

In [None]:
!pip install pandas
!pip install sklearn

: 

Let's create an example of a messy dataset (you can try cleaning a real dataset instead as an excercise)

In [None]:
messy_data = [[{"a": 1, "b": 3}, 2], [{"a": 4, "b": 8, "age": "young"}, 5]]
messy_data

In [None]:
import pandas as pd
messy_df = pd.DataFrame(messy_data, columns=("features", "label"))
messy_df

Let's tidy the dataset. Please remember, that each messy dataset is messy in it's own way - the tidying process depends on the data. This is just a minimum example:

In [None]:
tidy_df = pd.concat(
    [messy_df["features"].apply(pd.Series), messy_df[["label"]]], axis=1
)
tidy_df

Example of a messy model class (even messier would be to have just a script with no reusable components like classes and functions)

In [None]:
from sklearn.linear_model import LinearRegression as reg


class MessyModel:
    """
    Poorly constructed ML model class
    """

    def __init__(self, data):
        # model accepts data in custom (messy) format
        # and it has to be cleaned before use
        self.data = pd.concat(
            [data["features"].apply(pd.Series), data[["label"]]], axis=1
        )
        self.data = pd.concat(
            [self.data[["a", "b"]].apply(pd.Series), self.data[["label"]]], axis=1
        )
        # in addition model instance creates an unnecessary copy of the data
        self.score = np.nan

    def y_val(self, X):  # unconventional naming, non-verb, difficult to understand
        # a function first does a side-effect
        self.model = reg()
        self.model.fit(self.data.iloc[:, :-1], self.data.iloc[:, -1])
        # and then a transformation
        return self.model.predict(X)

    def score_calculations(self):
        # again, mixing transformations, side-effects and poor naming
        self.score = self.model.score(self.data.iloc[:, :-1], self.data.iloc[:, -1])
        return self


# and we can not pipe the functions!
messymodel = MessyModel(messy_df)
print(f"fit model and predict new values, I guess? {messymodel.y_val(test_x)}")
messymodel.score_calculations()
print(f"score, but whay what score and what the ? {messymodel.score}")

Now, let's reconstruct the class the tidy way!

In [None]:
train_df = tidy_df.drop("age", axis=1)
X_train = train_df.iloc[:, :-1]
y_train = train_df.iloc[:, -1]


class TidyModel:
    """
    Fit, predict and evaluate a linear regression model
    """

    def __init__(self):
        # model instance does not contain a copy of the data
        self.model = reg()

    def fit(self, X, y):
        self.model.fit(X, y) # the model only accepts tidy data
        # This function performs a side-effect only (model fit), so it returns self
        return self

    def predict(self, X) -> np.ndarray:
        # and here we do a transformation X -> y, so we return values!
        return self.model.predict(X)  # start noticing something?

    def score(self, X, y):
        return self.model.score(
            X, y
        )  # yup, Sklearn and other standard tools follow tidy principles!

# and so we can pipe simple functions!
tidymodel = TidyModel().fit(X_train, y_train)  # see pipe!
tidymodel.score(X_train, y_train)  # pipe!
TidyModel().fit(X_train, y_train).score(X_train, y_train)  # pipe again!

# can we go any further? Of course!
MAE = np.abs(TidyModel().fit(X_train, y_train).predict(X_train) - y_train).mean()
# wow we can do even longer pipes that are still readable and meaningful and very convenient!!
MAE