# UBC Canadian Cheese Analysis

## Introduction

Canada has produced cheese since the early 1600s. French settlers brought cattle from Normandy and over time the Norman cows were bred to produce the Canadienne breed[1].

Canadian cheese comes in many forms and flavours. Milk from cows, goats, and sheep are used to make some very tasty products.

Fat content in cheese plays a crucial role in determining its flavor, texture, nutritional value, and culinary applications. Different levels of fat content can cater to various preferences and dietary needs. 

Fat in cheese is important for people who have cardiovascular concerns, the saturated fat content in cheese may be a consideration. Lower-fat cheese options are available for those seeking to reduce their fat intake. Additonally, fat content of cheese can also be related to its lactose content. Harder, aged cheeses with higher fat content often have lower lactose levels, making them more suitable for people with lactose intolerance.

Culinary uses of cheese range from a simple slice of provolone on a sandwich to cacio de pepe, where it is melted as part of a sauce. Fat affects how the cheese melts as higher-fat cheeses tend to melt more smoothly and evenly, making them ideal for sauces, fondues, and toppings.

In this analysis we will be looking in to fat content. What are the factors that lead to the fat content in cheese? Can we predict the fat content? Knowing these factors can help consumers and commercial enterprises make choices for the cheese they want.

We will use machine learning regression techniques to train a model, with the best paramters, to produce a tool to predict fat content in cheese based on data such as the animal that produced the milk, the province where the cheese was produced, and whether it was produced organically.

## Data

We will be using data from the Canadian Cheese Directory to perform our analysis that will lead to our predictive model.

First we need to make sure the data is in a state that we can use it. Fields that are blank will need to be filled in for example. Let's build a pipeline to prepare our data.

Import required python libraries

In [1]:
import altair as alt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_validate, cross_val_score, KFold
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    scale,
    LabelEncoder)
from sklearn.metrics import classification_report, mean_absolute_error, mean_squared_error, r2_score
from sklearn.svm import  SVR
from sklearn.linear_model import Ridge, LogisticRegression, LinearRegression
from sklearn.feature_extraction.text import CountVectorizer



Load the data in to a dataframe

In [2]:
cheese_df = pd.read_csv("data/cheese_data.csv")
cheese_df.head(10)

Unnamed: 0,CheeseId,ManufacturerProvCode,ManufacturingTypeEn,MoisturePercent,FlavourEn,CharacteristicsEn,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,RindTypeEn,CheeseName,FatLevel
0,228,NB,Farmstead,47.0,"Sharp, lactic",Uncooked,0,Firm Cheese,Ewe,Raw Milk,Washed Rind,Sieur de Duplessis (Le),lower fat
1,242,NB,Farmstead,47.9,"Sharp, lactic, lightly caramelized",Uncooked,0,Semi-soft Cheese,Cow,Raw Milk,Washed Rind,Tomme Le Champ Doré,lower fat
2,301,ON,Industrial,54.0,"Mild, tangy, and fruity","Pressed and cooked cheese, pasta filata, inter...",0,Firm Cheese,Cow,Pasteurized,,Provolone Sette Fette (Tre-Stelle),lower fat
3,303,NB,Farmstead,47.0,Sharp with fruity notes and a hint of wild honey,,0,Veined Cheeses,Cow,Raw Milk,,Geai Bleu (Le),lower fat
4,319,NB,Farmstead,49.4,Softer taste,,1,Semi-soft Cheese,Cow,Raw Milk,Washed Rind,Gamin (Le),lower fat
5,350,NB,Industrial,48.0,,Classic fresh cooking cheeses,0,Fresh Cheese,Cow,Pasteurized,,Paneer (Northumberland Co-operative),lower fat
6,374,ON,Industrial,52.0,"Rich, creamy, buttery, both subtle and tangy i...",,0,Soft Cheese,Goat,Pasteurized,Bloomy Rind,Goat Brie (Woolwich),lower fat
7,375,ON,Industrial,41.0,Mild,"Whitem, smooth, firm textured",0,Firm Cheese,Goat,Pasteurized,,Goat Cheddar (Woolwich),lower fat
8,376,ON,Industrial,50.0,Mild,,0,Semi-soft Cheese,Goat,Pasteurized,,Goat Mozarella (Woolwich),lower fat
9,378,ON,Industrial,55.0,"Sharp, tangy, salty",With or without brine,0,Soft Cheese,Goat,Pasteurized,,Goat Feta (Woolwich),lower fat


In [3]:
# Check unique values and their counts in FlavourEn
print(cheese_df['FlavourEn'].value_counts(dropna=False))

# Check unique values and their counts in RindTypeEn
print(cheese_df['RindTypeEn'].value_counts(dropna=False))

# Check unique values and their counts in CharacteristicsEn
print(cheese_df['CharacteristicsEn'].value_counts(dropna=False))

NaN                                                                              241
Mild                                                                              59
Sharp                                                                             13
Hazelnut flavor that intensifies with age                                         10
Hazelnut flavour that intensifies with age                                         9
                                                                                ... 
Creamy and rich tasting.  Creamy in colour due to the carotene in Jersey milk      1
Strong notes of butter and caramel                                                 1
Slight hazelnut taste                                                              1
Mild, almond taste                                                                 1
Fruity, mushrooms and hazelnut flavor                                              1
Name: FlavourEn, Length: 636, dtype: int64
No Rind         404
Na

In order to follow the Golden Rule where test data cannot influence the training model. We will need to split the data in to test dataframes and training data frames. Splitting the data into training and test sets ensures that the model is trained on one subset of the data and evaluated on another. This helps in assessing the model's performance on unseen data, mimicking real-world scenarios. Setting a random state parameter ensures that the data splitting is consistent across multiple runs. This makes the results reproducible, which is important for debugging, comparing different models, and validating results.

In [4]:
train_df, test_df = train_test_split(cheese_df, test_size=0.3, random_state=123)

# Plot with Altair
alt.Chart(cheese_df).mark_circle().encode(
    x='MoisturePercent:Q',
    y='FatLevel:N',
    color=alt.Color('FatLevel', scale=alt.Scale(scheme='viridis')),
    tooltip=['MoisturePercent', 'FatLevel']
).properties(
    width=600,
    height=400,
    title='Figure 1: Moisture Percent vs Fat Level'
).interactive()



In [5]:
# Plot with Altair
alt.Chart(cheese_df).mark_rect().encode(
    x='MilkTypeEn:N',
    y='FatLevel:N',
    color=alt.Color('MilkTypeEn:N', legend=None),
    tooltip=['MilkTypeEn', 'FatLevel']
).properties(
    width=600,
    height=400,
    title='Figure 2: Average Fat Level by Milk Type'
).interactive()


In [6]:
# Convert FatLevel to numerical values
label_encoder = LabelEncoder()
cheese_df['FatLevel_num'] = label_encoder.fit_transform(cheese_df['FatLevel'])

# Plot with Altair
alt.Chart(cheese_df).mark_bar().encode(
    x='CategoryTypeEn:N',
    y='mean(FatLevel_num):Q',
    color=alt.Color('CategoryTypeEn:N', legend=None),
    tooltip=['CategoryTypeEn', 'mean(FatLevel_num)']
).properties(
    width=600,
    height=400,
    title='Figure 3: Average Fat Level by Cheese Category'
).interactive()



In [7]:
# Plot with Altair
alt.Chart(cheese_df).mark_bar().encode(
    x='Organic:N',
    y='mean(FatLevel_num):Q',
    color=alt.Color('Organic:N', legend=None),
    tooltip=['Organic', 'mean(FatLevel_num)']
).properties(
    width=600,
    height=400,
    title='Figure 4: Average Fat Level by Organic Status'
).interactive()




Looking at the dataframe we can see rows with null values in some features. Flavour, characteristics, and rind type features have significant gaps. Some interesting features such as moisure, cheese category, animal milk, milk treatment, and whether the milk was organic in nature may be useful in predicting fat levels. 

Aligning the features in to types will be helpful. Some of the features are binary in nature, some numeric, and others are categorical.

Data preprocessing will be required.

Some of the features contain text that is subjective and vary in thier formatting. These features may not be needed to predict fat levels.

Fat level would be our target feature, as that is what we are trying to predict.

Mean Absolute Error (MAE) calculates the average absolute difference between the predicted and actual fat levels. MAE is easy to interpret and provides a straightforward measure of the model's prediction error in the same units as the target variable.

Mean Squared Error (MSE) calculates the average squared difference between the predicted and actual fat levels. It penalizes larger errors more heavily than MAE and is useful for understanding the spread of errors in the predictions.

### The cheese dataframe is has missing values. Some of the features are numerical, some are binary, some are categorical. We will need to build a pipeline to preprocess this dataframe so it is useful to us.

In [8]:
# Load the data
cheese_df = pd.read_csv('data/cheese_data.csv')

# Transform FatLevel to numerical (arbitrary conversion)
fat_level_mapping = {'lower fat': 0, 'higher fat': 1}
cheese_df['FatLevelNum'] = cheese_df['FatLevel'].map(fat_level_mapping)

# Specify features and target
features = ['MoisturePercent', 'ManufacturerProvCode', 'ManufacturingTypeEn', 'FlavourEn', 'CharacteristicsEn', 'CategoryTypeEn', 'MilkTypeEn', 'MilkTreatmentTypeEn', 'RindTypeEn', 'CheeseName', 'Organic', 'FatLevelNum']
target = 'FatLevelNum'

# Remove any features that are not present in the dataframe
features = [feat for feat in features if feat in cheese_df.columns]

# Place features in categories
numeric_features = ['MoisturePercent']
categorical_features = ['ManufacturerProvCode', 'ManufacturingTypeEn', 'CategoryTypeEn', 'MilkTypeEn', 'MilkTreatmentTypeEn', 'RindTypeEn']
binary_features = ['Organic']


# Create pipelines to preprocess the data
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with most frequent category
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

binary_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))  # Impute missing values with the most frequent category
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features),
        ('binary', binary_transformer, binary_features)
    ])

# Fit and transform the preprocessor on the entire dataframe
X_preprocessed = preprocessor.fit_transform(cheese_df)

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, cheese_df[target], test_size=0.3, random_state=123)


# 1. Check shape
print("Shape of X_train:", X_train.shape)

# 2. View the first few rows of data
print("First few rows of X_train:")
print(X_train[:5])  # Adjust the number of rows as needed

# 3. Check for missing values
missing_values = np.isnan(X_train.toarray()).sum()
print("Number of missing values in X_train:", missing_values)

num_columns = X_train.shape[1]
print("Number of columns in X_train:", num_columns)

num_columns_df = cheese_df.shape[1]
print("Number of columns in cheese_df:", num_columns_df)



Shape of X_train: (729, 36)
First few rows of X_train:
  (0, 0)	1.3577417965655938
  (0, 9)	1.0
  (0, 12)	1.0
  (0, 18)	1.0
  (0, 27)	1.0
  (0, 28)	1.0
  (0, 31)	1.0
  (1, 0)	-1.2673843420060433
  (1, 9)	1.0
  (1, 11)	1.0
  (1, 14)	1.0
  (1, 21)	1.0
  (1, 28)	1.0
  (1, 33)	1.0
  (2, 0)	0.307691341136939
  (2, 9)	1.0
  (2, 11)	1.0
  (2, 17)	1.0
  (2, 21)	1.0
  (2, 28)	1.0
  (2, 34)	1.0
  (3, 0)	-0.8473641598345814
  (3, 9)	1.0
  (3, 13)	1.0
  (3, 14)	1.0
  (3, 21)	1.0
  (3, 29)	1.0
  (3, 33)	1.0
  (4, 0)	-2.2124297518918326
  (4, 9)	1.0
  (4, 13)	1.0
  (4, 17)	1.0
  (4, 21)	1.0
  (4, 28)	1.0
  (4, 33)	1.0
Number of missing values in X_train: 0
Number of columns in X_train: 36
Number of columns in cheese_df: 14


### We are now ready to create our model. We will use regression, along with our preprocessor in a pipeline

In [9]:
# Define the model
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', RandomForestRegressor(n_estimators=100, random_state=0))])

### Let's cross validate our model.
Cross-validation is a critical tool in machine learning for assessing model performance, reducing overfitting, tuning hyperparameters, and making informed decisions about what model to use. Let's use KFold[2] to cross-validate. The function helps us avoid over fitting our model.

In [10]:
# Perform cross-validation using the raw data and the pipeline
kf = KFold(n_splits=10, shuffle=True, random_state=1)
cv_scores = cross_val_score(model_pipeline, cheese_df[features], cheese_df[target], cv=kf, scoring='neg_mean_absolute_error')
cv_mae = -cv_scores.mean()

cv_mse_scores = cross_val_score(model_pipeline, cheese_df[features], cheese_df[target], cv=kf, scoring='neg_mean_squared_error')
cv_mse = -cv_mse_scores.mean()

# Print the cross-validated Mean Absolute Error and Mean Squared Error
print(f"Cross-validated MAE: {cv_mae}")
print(f"Cross-validated MSE: {cv_mse}")

Cross-validated MAE: 0.19418225729122482
Cross-validated MSE: 0.11233256146451176


### What are the important features?

In [11]:
# Fit the model to the entire training set
model_pipeline.fit(cheese_df[features], cheese_df[target])

# Extract the trained RandomForestRegressor from the pipeline
regressor = model_pipeline.named_steps['regressor']

# Get feature importances
feature_importances = regressor.feature_importances_

# Map the feature importances to feature names
# We need to account for the one-hot encoded categorical features
ohe_feature_names = model_pipeline.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names(categorical_features)
all_feature_names = numeric_features + list(ohe_feature_names) + binary_features

# Create a DataFrame for better visualization
feature_importances_df = pd.DataFrame({
    'Feature': all_feature_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print("Feature Importances:")
print(feature_importances_df)

Feature Importances:
                            Feature  Importance
0                   MoisturePercent    0.603548
27                  MilkTypeEn_Goat    0.031011
11      ManufacturingTypeEn_Artisan    0.028824
31           RindTypeEn_Bloomy Rind    0.024466
13   ManufacturingTypeEn_Industrial    0.024155
9           ManufacturerProvCode_QC    0.023431
7           ManufacturerProvCode_ON    0.021721
18       CategoryTypeEn_Soft Cheese    0.020542
35                          Organic    0.020365
21                   MilkTypeEn_Cow    0.018782
2           ManufacturerProvCode_BC    0.017881
12    ManufacturingTypeEn_Farmstead    0.016395
28  MilkTreatmentTypeEn_Pasteurized    0.016246
14       CategoryTypeEn_Firm Cheese    0.015466
34           RindTypeEn_Washed Rind    0.014874
17  CategoryTypeEn_Semi-soft Cheese    0.014525
24                   MilkTypeEn_Ewe    0.013030
16       CategoryTypeEn_Hard Cheese    0.012171
33               RindTypeEn_No Rind    0.011621
29     MilkTreatmen

Moisture content seems to have a strong correlation with fat levels.

### Let's tune our hyperparameters

In [12]:
# Redefine the model pipeline with RandomForestRegressor
model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('regressor', RandomForestRegressor())])

# Define the parameter grid for RandomForestRegressor
param_grid = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__max_depth': [None, 10, 20, 30],
    'regressor__min_samples_split': [2, 5, 10],
    'regressor__min_samples_leaf': [1, 2, 4]
}

# Use GridSearchCV to find the best parameters for MAE and R-squared
grid_search_mae = GridSearchCV(model_pipeline, param_grid, cv=5, scoring=['neg_mean_absolute_error', 'r2'], refit=False)
grid_search_mae.fit(cheese_df[features], cheese_df[target])

# Use GridSearchCV to find the best parameters for MSE and R-squared
grid_search_mse = GridSearchCV(model_pipeline, param_grid, cv=5, scoring=['neg_mean_squared_error', 'r2'], refit=False)
grid_search_mse.fit(cheese_df[features], cheese_df[target])

# Extracting best parameters for MAE and R-squared
best_params_mae = grid_search_mae.cv_results_['params'][np.argmax(grid_search_mae.cv_results_['mean_test_neg_mean_absolute_error'])]
print(f"Best parameters MAE: {best_params_mae}")

# Extracting best parameters for MSE and R-squared
best_params_mse = grid_search_mse.cv_results_['params'][np.argmax(grid_search_mse.cv_results_['mean_test_neg_mean_squared_error'])]
print(f"Best parameters MSE: {best_params_mse}")

# Extracting grid search results for MAE and R-squared
results_mae = pd.DataFrame(grid_search_mae.cv_results_)
results_mae['param_regressor__max_depth'] = results_mae['param_regressor__max_depth'].astype(str)

# Extracting grid search results for MSE and R-squared
results_mse = pd.DataFrame(grid_search_mse.cv_results_)
results_mse['param_regressor__max_depth'] = results_mse['param_regressor__max_depth'].astype(str)


# Plotting the grid search results for MAE and R-squared
chart_mae = alt.Chart(results_mae).mark_point().encode(
    x='param_regressor__n_estimators',
    y='mean_test_neg_mean_absolute_error',
    color='param_regressor__max_depth:N',
    tooltip=['param_regressor__n_estimators', 'mean_test_neg_mean_absolute_error', 'mean_test_r2']
).properties(
    title='Figure 5: Grid Search Results for MAE and R2'
).interactive()

# Plotting the grid search results for MSE and R-squared
chart_mse = alt.Chart(results_mse).mark_point().encode(
    x='param_regressor__n_estimators',
    y='mean_test_neg_mean_squared_error',
    color='param_regressor__max_depth:N',
    tooltip=['param_regressor__n_estimators', 'mean_test_neg_mean_squared_error', 'mean_test_r2']
).properties(
    title='Figure 6: Grid Search Results for MSE and R2'
).interactive()

# Displaying the charts
chart_mae | chart_mse


Best parameters MAE: {'regressor__max_depth': None, 'regressor__min_samples_leaf': 1, 'regressor__min_samples_split': 2, 'regressor__n_estimators': 100}
Best parameters MSE: {'regressor__max_depth': 20, 'regressor__min_samples_leaf': 1, 'regressor__min_samples_split': 10, 'regressor__n_estimators': 50}


### Let's evaluate these hyperparameters on the test data set

In [13]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(cheese_df[features], cheese_df[target], test_size=0.3, random_state=123)

# Fit the model with the best MAE parameters to the training data
model_pipeline.set_params(**best_params_mae)
model_pipeline.fit(X_train, y_train)

# Predict MAE on the test set
y_pred_mae = model_pipeline.predict(X_test)

# Calculate MAE based evaluation metrics
test_mae_mae = mean_absolute_error(y_test, y_pred_mae)
test_mse_mae = mean_squared_error(y_test, y_pred_mae)
test_r2_mae = r2_score(y_test, y_pred_mae)

print(f"Test MAE: {test_mae_mae} using MAE best params")
print(f"Test MSE: {test_mse_mae} using MAE best params")
print(f"Test R2: {test_r2_mae} using MAE best params")

# Fit the model with the best parameters for MSE to the training data
model_pipeline.set_params(**best_params_mse)
model_pipeline.fit(X_train, y_train)

# Predict on the test set using best MSE parameters
y_pred_mse = model_pipeline.predict(X_test)

# Calculate evaluation metrics for MSE
test_mae_mse = mean_absolute_error(y_test, y_pred_mse)
test_mse_mse = mean_squared_error(y_test, y_pred_mse)
test_r2_mse = r2_score(y_test, y_pred_mse)

print(f"Test MAE: {test_mae_mse} using MSE best params")
print(f"Test MSE: {test_mse_mse} using MSE best params")
print(f"Test R2: {test_r2_mse} using MSE best params")

# Define the data for visualization
plot_data = pd.DataFrame({
    'Metric': ['MAE', 'MSE', 'R-squared'],
    'Using MAE Best Params': [test_mae_mae, test_mse_mae, test_r2_mae],
    'Using MSE Best Params': [test_mae_mse, test_mse_mse, test_r2_mse]
})

# Melt the data for better visualization
data_melted = plot_data.melt('Metric', var_name='Parameter', value_name='Value')

# Create the Altair chart
hyper_chart = alt.Chart(data_melted).mark_bar().encode(
    x='Metric',
    y='Value',
    color='Parameter'
).properties(
    title='Figure 7: Comparison of Model Performance Metrics'
)

# Display the chart
hyper_chart

Test MAE: 0.19783583026036 using MAE best params
Test MSE: 0.11285889854391656 using MAE best params
Test R2: 0.5088542363428856 using MAE best params
Test MAE: 0.21208238931479534 using MSE best params
Test MSE: 0.11097653614163598 using MSE best params
Test R2: 0.5170460079397683 using MSE best params


MSE with hyperparameters tuned seems to be the best regression model.

### Let's compare against other models to ensure we are on the right track

In [14]:
# Define a function to evaluate models
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mae, mse, r2

# Initialize models
models = {
    'RandomForest': RandomForestRegressor(n_estimators=100, random_state=0),
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(),
    'SVR': SVR()
}

# Store results in a list
model_results = []

# Evaluate each model
for name, model in models.items():
    model_pipeline.set_params(regressor=model)
    mae, mse, r2 = evaluate_model(model_pipeline, X_train, X_test, y_train, y_test)
    model_results.append({'Model': name, 'MAE': mae, 'MSE': mse, 'R2': r2})
    
# Create a dataframe from the results
model_results_df = pd.DataFrame(model_results)

# Melt the dataframe for visualization
model_melted_df = pd.melt(model_results_df, id_vars=['Model'], value_vars=['MAE', 'MSE', 'R2'], var_name='Metric', value_name='Value')

# Create the Altair chart
model_chart = alt.Chart(model_melted_df).mark_bar().encode(
    x=alt.X('Model', sort=list(models.keys())),
    y='Value',
    color='Metric',
    column='Metric'
).properties(
    title='Figure 8: Model Performance Metrics Comparison'
).interactive()

# Show the chart
model_chart

#### Based on these results, the RandomForest model appears to perform the best in terms of minimizing both MAE and MSE, and maximizing the R2 score.

Regression is suitable for predicting the fat level in this dataset because:

Continuous Target Variable: The target variable, fat level, is a continuous variable that represents the amount of fat in the cheese. Regression models are well-suited for predicting continuous outcomes.

Numerical Features: The dataset contains numerical features such as MoisturePercent, which are likely to have a linear or nonlinear relationship with the fat level. Regression models can capture these relationships and make predictions based on them.

Interpretability: Regression models provide coefficients for each feature, which can be interpreted to understand the direction and strength of their relationship with the target variable. This interpretability is valuable for understanding which factors influence the fat level in cheese.

Model Complexity: The dataset may have nonlinear relationships between features and the target variable. Regression models, especially ensemble methods like Random Forest Regressor, can capture complex patterns in the data without overfitting, making them suitable for this task.

Performance: Based on the evaluation metrics such as MAE, MSE, and R-squared, the Random Forest Regressor has shown promising performance in predicting the fat level. It has lower MAE and MSE compared to other models, indicating better predictive accuracy.

Overall, considering the nature of the target variable, the types of features available, interpretability requirements, and model performance, regression appears to be a suitable choice for predicting the fat level in this dataset.

### Baseline model and it's scores:

In [15]:
# Calculate mean fat level
mean_fat_level = cheese_df['FatLevelNum'].mean()

# Create predictions based on the mean fat level
baseline_predictions = np.full_like(y_test, fill_value=mean_fat_level)

# Evaluate baseline model
baseline_mae = mean_absolute_error(y_test, baseline_predictions)
baseline_mse = mean_squared_error(y_test, baseline_predictions)
baseline_r2 = r2_score(y_test, baseline_predictions)

print("Baseline Model - MAE:", baseline_mae)
print("Baseline Model - MSE:", baseline_mse)
print("Baseline Model - R2:", baseline_r2)



Baseline Model - MAE: 0.35782747603833864
Baseline Model - MSE: 0.35782747603833864
Baseline Model - R2: -0.5572139303482586


### Tuned model results:

In [16]:
print(f"Test MAE: {test_mae_mse} using MSE best params")
print(f"Test MSE: {test_mse_mse} using MSE best params")
print(f"Test R2: {test_r2_mse} using MSE best params")

Test MAE: 0.21208238931479534 using MSE best params
Test MSE: 0.11097653614163598 using MSE best params
Test R2: 0.5170460079397683 using MSE best params


The tuned hyperparamter model significantly out performs the baseline.

MAE (Mean Absolute Error): The average absolute difference between the predicted and actual fat levels. Lower MAE indicates better performance. The tuned model has a lower MAE (0.195) compared to the baseline model (0.358), indicating that it makes more accurate predictions.

MSE (Mean Squared Error): The average of the squares of the errors between the predicted and actual fat levels. Again, lower MSE is better. The tuned model has a lower MSE (0.113) compared to the baseline model (also 0.358), indicating better accuracy.

R-squared: Represents the proportion of the variance in the target variable (fat level) that is predictable from the independent variables (features). It ranges from 0 to 1, with higher values indicating better predictive performance. The actual model has an R-squared value of 0.509, which is higher than the baseline model's R-squared value of -0.557. A negative R-squared indicates that the model does not capture any meaningful information about fat levels. This means that the tuned model explains more variance in the fat levels compared to just predicting the mean for all samples.

# References:

1. Wikipedia https://en.wikipedia.org/wiki/Canadian_cheese#:~:text=7%20Further%20reading-,History,Norwich%2C%20Ontario%2C%20in%201864.
2. Scikit Learn https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
    
