# Comparing Sampling Algorithms with Fast Food

Many different sampling approaches are made available with `astartes`, both extrapolative and interpolative.
Some of the implemented algorithms are built on one another (SPXY is an extension of Kennard-Stone) and others are entirely unique are quite complex (Sphere Exlusion comes to mind).
But what do they actually look like when it comes to splitting data into groups? And how do they different sampling approaches affect model performance?

For this notebook, we will use a very tangible dataset - the Burger King Menu - and subject it to the various sampling algorithms present in `astartes` and then visualize on a plot what the results look like.
The features for the dataset are the grams of protein, fat, and carbohydrates present in each menu item at the restaurant and the target is the number of calories in that item.
We know _a priori_ that there is an underlying correlation in this data - each macronutrient has a different number of calories, and by calculating a weighted sum of the grams of each macronutrient we can easily find the calories.
Our goal would be for the model to learn this simple correlation.

The dataset used in this notebook was retrieved from [this link](https://www.kaggle.com/datasets/mattop/burger-king-menu-nutrition-data?resource=download) and is in the public domain (see the [CC0 license](https://creativecommons.org/share-your-work/public-domain/cc0/)).

## Part 1. Perusing the Menu
Let's start by loading the dataset and generating our first plot of its contents. To get the data into this notebook we will use `pandas`, one of the ubiquitous Python packages for machine learning. You can read more about Pandas in their [documentation](https://pandas.pydata.org/docs/).

In [14]:
import pandas as pd
with open('burger-king-menu.csv', 'r') as f:
    menu = pd.read_csv(f)
menu.describe()

Unnamed: 0,Calories,Fat Calories,Fat (g),Saturated Fat (g),Trans Fat (g),Cholesterol (mg),Sodium (mg),Total Carb (g),Dietary Fiber (g),Sugars (g),Protein (g),Weight Watchers
count,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0,77.0
mean,501.428571,278.311688,30.967532,9.805195,0.636364,101.753247,993.246753,35.181818,1.779221,6.636364,20.909091,497.064935
std,307.612685,184.393762,20.535966,8.118431,1.128682,97.958659,613.426403,20.716588,1.690713,6.973463,17.145033,302.23807
min,10.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,12.0
25%,260.0,140.0,16.0,3.5,0.0,25.0,470.0,26.0,1.0,1.0,12.0,252.0
50%,430.0,250.0,28.0,8.0,0.0,70.0,1010.0,30.0,1.0,6.0,17.0,416.0
75%,700.0,380.0,42.0,14.0,0.5,175.0,1420.0,49.0,2.0,10.0,28.0,690.0
max,1220.0,750.0,84.0,33.0,4.5,390.0,2840.0,110.0,9.0,40.0,71.0,1192.0


As you can see from the output of `menu.describe()`, this dataset has 77 total items and more columns than we need, so let's drop those columns. We will also rename some of the columns to be easier to use, and we will hold onto the labels for later comparison to the splits that `astartes` creates.

In [15]:
menu.drop(['Item','Fat Calories','Saturated Fat (g)','Trans Fat (g)','Cholesterol (mg)','Sodium (mg)','Dietary Fiber (g)','Sugars (g)','Weight Watchers'], axis=1, inplace=True)
menu.rename(columns={'Total Carb (g)': 'Carbohydrates','Fat (g)': 'Fat','Protein (g)':'Protein'})

Unnamed: 0,Category,Calories,Fat,Carbohydrates,Protein
0,Burgers,660.0,40.0,49.0,28.0
1,Burgers,740.0,46.0,50.0,32.0
2,Burgers,790.0,51.0,50.0,35.0
3,Burgers,900.0,58.0,49.0,48.0
4,Burgers,980.0,64.0,50.0,52.0
...,...,...,...,...,...
72,Breakfast,40.0,0.0,11.0,0.0
73,Breakfast,140.0,15.0,1.0,1.0
74,Breakfast,80.0,8.0,2.0,0.0
75,Breakfast,150.0,15.0,3.0,0.0
