# Predicting Low Fat Cheese
#### by Patrick Dann


## Introduction

Back to [Intro](intro.md).

In this Project I will predict the fat content of cheese based on it's properties. This is a classification problem since we will be grouping the cheese into a category such as; low fat, high fat, etc. 

Predicting the fat content of cheese is desiarable since the fat content may be important for cooking and how the cheese properties will influsence a dish. Cheese ia also a source of saturated fat which would be link to the total fat content of the cheese and is seen as undisiarable by many people. Additionally the fat content will greatly effect the calorie content of the cheese. Therefore, when manufacturing a cheese the fat content should be considered and knowing what properties contribute to making a low fat cheese is valuable. 

We will be looking for cheese with a low fat content and see what properties are predictive for such cheeses. 

## Exploratory Data Analysis

In [1]:
# Import libraries needed 

import altair as alt
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import make_scorer, precision_score, recall_score, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler,
    OrdinalEncoder,)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
from sklearn.svm import SVC, SVR
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

I will Select the cheese data and split it into test and train sets so the golden rule is not vialated.

In [11]:
# Read in the cheese data
cheese_df=pd.read_csv(r"C:\Users\ichir\Documents\final-assignment\cheese_data.csv")
cheese_df

Unnamed: 0,CheeseId,ManufacturerProvCode,ManufacturingTypeEn,MoisturePercent,FlavourEn,CharacteristicsEn,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,RindTypeEn,CheeseName,FatLevel
0,228,NB,Farmstead,47.0,"Sharp, lactic",Uncooked,0,Firm Cheese,Ewe,Raw Milk,Washed Rind,Sieur de Duplessis (Le),lower fat
1,242,NB,Farmstead,47.9,"Sharp, lactic, lightly caramelized",Uncooked,0,Semi-soft Cheese,Cow,Raw Milk,Washed Rind,Tomme Le Champ Doré,lower fat
2,301,ON,Industrial,54.0,"Mild, tangy, and fruity","Pressed and cooked cheese, pasta filata, inter...",0,Firm Cheese,Cow,Pasteurized,,Provolone Sette Fette (Tre-Stelle),lower fat
3,303,NB,Farmstead,47.0,Sharp with fruity notes and a hint of wild honey,,0,Veined Cheeses,Cow,Raw Milk,,Geai Bleu (Le),lower fat
4,319,NB,Farmstead,49.4,Softer taste,,1,Semi-soft Cheese,Cow,Raw Milk,Washed Rind,Gamin (Le),lower fat
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1037,2387,NS,Farmstead,37.0,"Dill, Caraway, Chili Pepper, Cumin, Sage, Chiv...",Fresh curds through a variety of added Organic...,1,Hard Cheese,Cow,Pasteurized,,Knoydart,higher fat
1038,2388,AB,Industrial,46.0,Mild and Deep Flavor,Low in Sodium and Fat,0,Fresh Cheese,Cow,Pasteurized,,FRESK-O,lower fat
1039,2389,NS,Artisan,40.0,Grassy tang and restrained saltiness that refl...,,0,Veined Cheeses,Ewe,Thermised,,Electric Blue,higher fat
1040,2390,NS,Artisan,34.0,Sweet and tangy flavours combine with hoppy no...,,0,Semi-soft Cheese,Ewe,Thermised,Washed Rind,Hip Hop,higher fat


In [12]:
# Now we will make the training and test sets 
train_df, test_df = train_test_split(cheese_df, test_size=0.2, random_state=77)
train_df.head()

Unnamed: 0,CheeseId,ManufacturerProvCode,ManufacturingTypeEn,MoisturePercent,FlavourEn,CharacteristicsEn,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,RindTypeEn,CheeseName,FatLevel
110,940,ON,Industrial,52.0,Tangy,Golden yellow,0,Semi-soft Cheese,Cow,Pasteurized,,Vaquinha (Portuguese),lower fat
762,1873,ON,Industrial,40.0,Slightly more salty taste and firmer texture t...,,0,Firm Cheese,Cow,Pasteurized,,Gorgonzola (Castello),higher fat
898,2057,QC,Artisan,48.0,,,0,Semi-soft Cheese,Cow,Pasteurized,,Tête à Papineau,lower fat
260,1280,QC,Industrial,55.0,"Creamy flavor, hazelnut and flowery hints","Fully ripened, washed rind cheese",0,Soft Cheese,Cow,Raw Milk,Washed Rind,Petit Rubis (Le),lower fat
223,1229,QC,Artisan,60.0,"Available plain or seasoned with chives, prove...","Creamy and white cheese, ball-shaped",0,Fresh Cheese,Goat,Pasteurized,No Rind,Petites Soeurs (Les),lower fat


I will seperat the features form the target and select the ones required in the analysis. 
The target `y` column will be `FatLevel` and I will use `MoisturePercent`, `Organic`, `MilkTypeEn`, `MilkTreatmentTypeEn` and `RindTypeEn` as the `x` since these are relevant to the manufaturing process and cheese properties. 

In [13]:
# Create the train and test splits

X_train= train_df.drop(columns=['CheeseId', 'ManufacturerProvCode', 'FlavourEn', 'CharacteristicsEn', 'CheeseName', 'FatLevel'])
y_train= train_df['FatLevel']
y_train= y_train.map({'lower fat': 1, 'higher fat': 0}).astype(int)

X_test= test_df.drop(columns=['CheeseId', 'ManufacturerProvCode', 'FlavourEn', 'CharacteristicsEn', 'CheeseName', 'FatLevel' ])
y_test= test_df['FatLevel']
y_test= y_test.map({'lower fat': 1, 'higher fat': 0}).astype(int)

X_train

Unnamed: 0,ManufacturingTypeEn,MoisturePercent,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,RindTypeEn
110,Industrial,52.0,0,Semi-soft Cheese,Cow,Pasteurized,
762,Industrial,40.0,0,Firm Cheese,Cow,Pasteurized,
898,Artisan,48.0,0,Semi-soft Cheese,Cow,Pasteurized,
260,Industrial,55.0,0,Soft Cheese,Cow,Raw Milk,Washed Rind
223,Artisan,60.0,0,Fresh Cheese,Goat,Pasteurized,No Rind
...,...,...,...,...,...,...,...
736,Artisan,55.0,1,Firm Cheese,Cow,Pasteurized,
927,Industrial,40.0,0,,Cow,,No Rind
235,Artisan,50.0,0,Soft Cheese,Cow,Pasteurized,Bloomy Rind
607,Industrial,41.0,0,Firm Cheese,Cow,Pasteurized,No Rind


Now I will look at the features and see dtypes for the variables.

In [14]:
# take a look at the dtypes of the train set 
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 833 entries, 110 to 727
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ManufacturingTypeEn  833 non-null    object 
 1   MoisturePercent      823 non-null    float64
 2   Organic              833 non-null    int64  
 3   CategoryTypeEn       814 non-null    object 
 4   MilkTypeEn           832 non-null    object 
 5   MilkTreatmentTypeEn  779 non-null    object 
 6   RindTypeEn           583 non-null    object 
dtypes: float64(1), int64(1), object(5)
memory usage: 52.1+ KB


We can see that there are 729 entries and the `MoisturePercent`, `CategoryTypeEn`, `MilkTypeEn`, `MilkTreatmentTypeEn`, and `RindTypeEn` all have null value. 

Aditionally we can see that there are 5 categorical features, 2 numerical.

Now I will have a deaper look at the features.

In [15]:
# looking at the prdinal column 
X_train['CategoryTypeEn'].unique()

array(['Semi-soft Cheese', 'Firm Cheese', 'Soft Cheese', 'Fresh Cheese',
       nan, 'Hard Cheese', 'Veined Cheeses'], dtype=object)

In [16]:
# describe the features 
X_train.describe

<bound method NDFrame.describe of     ManufacturingTypeEn  MoisturePercent  Organic    CategoryTypeEn  \
110          Industrial             52.0        0  Semi-soft Cheese   
762          Industrial             40.0        0       Firm Cheese   
898             Artisan             48.0        0  Semi-soft Cheese   
260          Industrial             55.0        0       Soft Cheese   
223             Artisan             60.0        0      Fresh Cheese   
..                  ...              ...      ...               ...   
736             Artisan             55.0        1       Firm Cheese   
927          Industrial             40.0        0               NaN   
235             Artisan             50.0        0       Soft Cheese   
607          Industrial             41.0        0       Firm Cheese   
727           Farmstead             39.0        1       Firm Cheese   

    MilkTypeEn MilkTreatmentTypeEn   RindTypeEn  
110        Cow         Pasteurized          NaN  
762        Co

Now I'll look at the numerical columns statistics

In [17]:
# describe the numerical features 
X_train.describe()

Unnamed: 0,MoisturePercent,Organic
count,823.0,833.0
mean,46.955043,0.094838
std,9.557279,0.293167
min,12.0,0.0
25%,40.0,0.0
50%,46.0,0.0
75%,52.0,0.0
max,88.0,1.0


Lets look at the `Organic` column since it apears to be binary.

In [18]:
X_train['Organic'].unique()

array([0, 1], dtype=int64)

The `Organic` column does appear to be binary and not numerical. That leaves us with one numberical column, `MoisturePercent`

There are null values in the columns `MoisturePercent`, `CategoryTypeEn`, `MilkTypeEn`, `MilkTreatmentTypeEn`, and `RindTypeEn` so these will have to be imputed.

I want to visualize some of the categorical features and see what kind of distributions I am dealing with.

In [19]:
# Visualize MilkTypeEn column distribution 
MilkType_plot = alt.Chart(cheese_df).mark_bar().encode(
                    alt.X('MilkTypeEn', title="Milk Type", sort='y'),
                    alt.Y('count()', title='Number of counts', stack=None),
                    alt.Color('FatLevel', title='Fat Level')).properties(title="Milk Type Distribution"
                                                                        ).facet('FatLevel')

MilkType_plot

In [20]:
# Visualize MilkTreatmentTypeEn column distribution 
MilkTreatmentType_plot = alt.Chart(cheese_df).mark_bar().encode(
                    alt.X('MilkTreatmentTypeEn', title="Milk Treatment Type", sort='y'),
                    alt.Y('count()', title='Number of counts', stack=None),
                    alt.Color('FatLevel', title='Fat Level')).properties(title="Milk Treatment Type Distribution"
                                                                        ).facet('FatLevel')

MilkTreatmentType_plot

In [21]:
# Visualize ManufacturingTypeEn column distribution 
ManufacturingType_plot = alt.Chart(cheese_df).mark_bar().encode(
                    alt.X('ManufacturingTypeEn', title="Manufacturing Type", sort='y'),
                    alt.Y('count()', title='Number of counts', stack=None),
                    alt.Color('FatLevel', title='Fat Level')).properties(title="Manufacturing Type Distribution"
                                                                        ).facet('FatLevel')

ManufacturingType_plot

Now lets see if there is any relationship between Moisture Percent and Fat level

In [22]:
# Plot MoisturePercent againt FatLevel 
MoisturePercent_plot = alt.Chart(cheese_df).mark_boxplot().encode(
                        alt.X('MoisturePercent', title='Moisture Percent'),
                        alt.Y('FatLevel', title='Fat Level')).properties(title='Moisture Percent Relationship with Fat Level')
MoisturePercent_plot

It seems that cheese with Lower Fat generally have a higher Moisture Percent. This makes sense since this would mean that the cheese has a higher amount of water by weight. 

## Methods and Results

I will first make a `DummyClassifier` so I can use this as a baseline to compare the final model too. 

In [23]:
# building and scoreing the DummyClassifier
dummy_model = DummyClassifier(strategy = 'prior')

scores = cross_validate(dummy_model, X_train, y_train, cv=5, return_train_score=True)

dummy_scores = pd.DataFrame(scores)

dummy_scores

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.000993,0.001,0.658683,0.654655
1,0.001998,0.0,0.652695,0.656156
2,0.001001,0.001001,0.652695,0.656156
3,0.000999,0.0,0.656627,0.655172
4,0.0,0.001,0.656627,0.655172


I want to see the mean score to use to compare to my model

In [24]:
dummy_scores.mean()

fit_time       0.000998
score_time     0.000600
test_score     0.655465
train_score    0.655462
dtype: float64

The mean test score is the same as the mean train score. This means that the model is probalby underfitting

I have already identified the Numerical, Binary, and Categorical features in the `X_train` so now I will define them in lists to make the pipelines

In [25]:
# Defining the numerical, binary, ordinal and categorical features 
numeric_feats = ['MoisturePercent' ]
binary_feats = ['Organic']
ordinal_feats = []
categorical_feats = ['ManufacturingTypeEn', 'MilkTreatmentTypeEn', 'MilkTypeEn','CategoryTypeEn', 'RindTypeEn'] 



Now I will make the Transformers for the Column Transformers. Since there are missing values I will have to use `SimpleImputer` and I will have to use the `OneHotEncoder` for the binary and categorical features so they can be used in the model.

In [26]:
# making the Numerical Transformer
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())])

# Making the Binary Transformer 
binary_transformer = Pipeline(
    steps=[("imputer",SimpleImputer(strategy='constant')),("onehot", OneHotEncoder(drop='if_binary', dtype=int))])

# Making the Categorical Transformer
categorical_transformer =Pipeline(
    steps=[("imputer", SimpleImputer(strategy="most_frequent", fill_value="missing")), ("onehot", OneHotEncoder(handle_unknown="ignore"))])

Now I will make the Column Transformer with the transformers and their designated features 

In [27]:
# Making the Columntransformer 
col_transformer = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_feats),
        ("cat", categorical_transformer, categorical_feats),
        ("binary", binary_transformer, binary_feats)])

Now I will make the Pipeline, I will use the `LogisticRegression` since we are looking at a classificatioin problem

In [28]:
# Making the Pipepline
lr_pipe = Pipeline(
    steps=[("preprocessor", col_transformer), ("reg", LogisticRegression(class_weight="balanced", max_iter=1000))])

In [29]:
# Fitting the pipeline on the training set
lr_pipe.fit(X_train, y_train)

In [30]:
# Finding the accuracy precision and recall of the model
                          
lr_scores = pd.DataFrame(cross_validate(lr_pipe, X_train, y_train, cv=5, return_train_score=True, scoring=['accuracy', 'precision', 'recall']))
lr_scores

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall
0,0.022999,0.006001,0.790419,0.77027,0.886598,0.867532,0.781818,0.766055
1,0.022,0.008005,0.718563,0.780781,0.878049,0.870229,0.66055,0.782609
2,0.023026,0.009976,0.808383,0.762763,0.840708,0.872,0.87156,0.748284
3,0.016267,0.005,0.740964,0.767616,0.875,0.86911,0.706422,0.759725
4,0.016002,0.004998,0.73494,0.778111,0.849462,0.871465,0.724771,0.775744


In [31]:
lr_scores.mean()

fit_time           0.020059
score_time         0.006796
test_accuracy      0.758654
train_accuracy     0.771908
test_precision     0.865963
train_precision    0.870067
test_recall        0.749024
train_recall       0.766483
dtype: float64

Now I will make a `SVC` model and see how it compares to the `LogisticRegression` model 

In [32]:
# Making the SVC model
SVC_pipe = Pipeline(
    steps=[("preprocessor", col_transformer), ("svc", SVC(class_weight="balanced"))])

In [33]:
# Fitting the model on the training data
SVC_pipe.fit(X_train, y_train)

In [34]:
# Scoring the SVC modal
SVC_scores = pd.DataFrame(cross_validate(SVC_pipe, X_train, y_train, cv=5, return_train_score=True, scoring=['accuracy', 'precision', 'recall']))
SVC_scores

Unnamed: 0,fit_time,score_time,test_accuracy,train_accuracy,test_precision,train_precision,test_recall,train_recall
0,0.024001,0.009116,0.790419,0.803303,0.912088,0.915531,0.754545,0.770642
1,0.024997,0.023,0.748503,0.822823,0.91358,0.939394,0.678899,0.78032
2,0.025001,0.008003,0.832335,0.810811,0.885714,0.923706,0.853211,0.775744
3,0.023001,0.007002,0.76506,0.806597,0.906977,0.935028,0.715596,0.757437
4,0.016997,0.007003,0.704819,0.827586,0.833333,0.942308,0.688073,0.784897


In [35]:
SVC_scores.mean()

fit_time           0.022800
score_time         0.010825
test_accuracy      0.768227
train_accuracy     0.814224
test_precision     0.890339
train_precision    0.931193
test_recall        0.738065
train_recall       0.773808
dtype: float64

In [36]:
# comparing the 2 models 
print('SVC scores')
print(SVC_scores.mean())
print('LogisticRegresssion scores')
print(lr_scores.mean())  

SVC scores
fit_time           0.022800
score_time         0.010825
test_accuracy      0.768227
train_accuracy     0.814224
test_precision     0.890339
train_precision    0.931193
test_recall        0.738065
train_recall       0.773808
dtype: float64
LogisticRegresssion scores
fit_time           0.020059
score_time         0.006796
test_accuracy      0.758654
train_accuracy     0.771908
test_precision     0.865963
train_precision    0.870067
test_recall        0.749024
train_recall       0.766483
dtype: float64


The `LogisticRegresssion` performs better than the `SVC` model on the accuracy and recall scores however the `SVC` model performs better on the precision score. 

In this case the recall is more important than the precision since it would be more detrimental if we mistakenly produce a high fat cheese (false negative) than miss a low fat cheese (false posititive)

I will tune the hyperparameters for the `LogisticRegression` estimator to find the best model but first I will have to find which parameters to tune. I will tune on recall since I iddentified that is the most important.


In [37]:
# searching the parameters for the LogisticRegression model
LogisticRegression().get_params().keys()

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [38]:
param_grid =  {  
        'reg__C': [100, 10, 1.0, 0.1, 0.01],
        'reg__penalty': ['l2'],
        'reg__solver': ['newton-cg', 'lbfgs', 'liblinear']}

scoring={
    'precision_score': make_scorer(precision_score),
    'recall_score': make_scorer(recall_score),
    'accuracy_score': make_scorer(accuracy_score)
}


# Tuning the Hyperparameters C, penaly, and solver with GridSearchCV
grid_search = GridSearchCV(lr_pipe, param_grid, cv=5, n_jobs=-1, verbose=3, return_train_score=True, scoring=scoring, refit='recall_score')

# Fitting to the training set
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits


In [39]:
grid_search.best_score_

0.7490408673894913

In [40]:
# Finiding the best parameters from the grid search 
grid_search.best_params_

{'reg__C': 10, 'reg__penalty': 'l2', 'reg__solver': 'newton-cg'}

In [41]:
# finding the best model
best_model = grid_search.best_estimator_

In [42]:
# Fitting the best model
best_model.fit(X_train, y_train)

In [43]:
# scoring the accuracy of the best model on the test set
best_model.score(X_test, y_test)

0.7464114832535885

In [45]:
# veiwing the classification report
print(classification_report(y_test, best_model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.60      0.73      0.66        71
           1       0.85      0.75      0.80       138

    accuracy                           0.75       209
   macro avg       0.73      0.74      0.73       209
weighted avg       0.76      0.75      0.75       209



In [46]:
# compared to the baseline `DummyClassifier
dummy_model.fit(X_train, y_train)

print(classification_report(y_test, dummy_model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00        71
           1       0.66      1.00      0.80       138

    accuracy                           0.66       209
   macro avg       0.33      0.50      0.40       209
weighted avg       0.44      0.66      0.53       209



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


After hyperparameter tuning the best model has an accuracy of 0.75, recall of 0.74. 

I want to see which features influence the prediciton the most so I have to find all the features and compare them to there particular coefficients. I will do this with `.coef_` 

In [47]:
# reading in the best LogisticRegression model
lr_reg=LogisticRegression(C=100, class_weight='balanced', max_iter=1000, solver='newton-cg')

# Transforming the X_train with the column transformer from the pipeline
X_train_transformed = col_transformer.transform(X_train)

# fitting the best LogisticRegression model 
lr_reg.fit(X_train_transformed, y_train)

# veiwing the coefficients
lr_coeffs =lr_reg.coef_
lr_coeffs

array([[ 2.34052182, -0.10277606,  0.11485351, -0.01361841,  0.28023868,
        -0.39494322,  0.11316359,  0.01086915,  0.02630025, -0.03188366,
         2.40951553, -0.59974009, -1.67867551, -1.15959917,  1.02167255,
        -0.16055839, -0.51860168,  1.27447797,  0.0405896 , -0.58124713,
        -0.05620132, -0.27765517, -1.20089644,  0.59673026,  0.88028039,
        -0.12516507]])

In [50]:
# Find the new categorical columns created by the column transformer
new_cols = col_transformer.named_transformers_["cat"].named_steps["onehot"].get_feature_names_out(categorical_feats)

# Make an object with all the features
columns = numeric_feats + list(new_cols) + binary_feats
columns

['MoisturePercent',
 'ManufacturingTypeEn_Artisan',
 'ManufacturingTypeEn_Farmstead',
 'ManufacturingTypeEn_Industrial',
 'MilkTreatmentTypeEn_Pasteurized',
 'MilkTreatmentTypeEn_Raw Milk',
 'MilkTreatmentTypeEn_Thermised',
 'MilkTypeEn_Buffalo Cow',
 'MilkTypeEn_Cow',
 'MilkTypeEn_Cow and Goat',
 'MilkTypeEn_Cow, Goat and Ewe',
 'MilkTypeEn_Ewe',
 'MilkTypeEn_Ewe and Cow',
 'MilkTypeEn_Ewe and Goat',
 'MilkTypeEn_Goat',
 'CategoryTypeEn_Firm Cheese',
 'CategoryTypeEn_Fresh Cheese',
 'CategoryTypeEn_Hard Cheese',
 'CategoryTypeEn_Semi-soft Cheese',
 'CategoryTypeEn_Soft Cheese',
 'CategoryTypeEn_Veined Cheeses',
 'RindTypeEn_Bloomy Rind',
 'RindTypeEn_Brushed Rind',
 'RindTypeEn_No Rind',
 'RindTypeEn_Washed Rind',
 'Organic']

In [51]:
lr_reg.intercept_

array([-0.02314974])

In [52]:
# Predict on the test set
predicted_y = best_model.predict(X_test)

# find the Probability 
proba_y = best_model.predict_proba(X_test)

# View the probability of low fat of eeach cheese name in a dataframe
lr_probs = pd.DataFrame({
            "Cheese name":test_df['CheeseName'],
             "true y":y_test, 
             "pred y": predicted_y.tolist(),
             "prob_LowFat": proba_y[:, 1].tolist()})
lr_probs.sort_values(by='prob_LowFat', ascending=False)

Unnamed: 0,Cheese name,true y,pred y,prob_LowFat
93,Burrino (Salerno),1,1,0.999989
776,Fromage Chèvre frais,1,1,0.999838
123,Ricotta (Shepherd Gourmet),1,1,0.999565
932,Paysanne (La),1,1,0.999380
632,Roulé (Le),1,1,0.999190
...,...,...,...,...
177,Blackburn (Le),0,0,0.034655
1041,Super Fresh Cheese Curds,0,0,0.029325
955,Frère Chasseur (Le),0,0,0.026626
143,Blossom's Blue,0,0,0.008703


In [53]:
# Veiwing the coefficients of the features in a dataframe
data = {'features': columns, 'coefficients':lr_coeffs[0]}
pd.DataFrame(data).sort_values(by='coefficients', ascending=False)

Unnamed: 0,features,coefficients
10,"MilkTypeEn_Cow, Goat and Ewe",2.409516
0,MoisturePercent,2.340522
17,CategoryTypeEn_Hard Cheese,1.274478
14,MilkTypeEn_Goat,1.021673
24,RindTypeEn_Washed Rind,0.88028
23,RindTypeEn_No Rind,0.59673
4,MilkTreatmentTypeEn_Pasteurized,0.280239
2,ManufacturingTypeEn_Farmstead,0.114854
6,MilkTreatmentTypeEn_Thermised,0.113164
18,CategoryTypeEn_Semi-soft Cheese,0.04059


## Discussion

The best model has an accuracy of 0.75, recall of 0.74, f1 scores of 0.73 and percision of 0.73. This is better than the baseline `DummyClassifier` which had a accuracy of 0.66, precision of 0.33, recall score of 0.50, and f1 scores of 0.40. 

The main metric I used to score the model was the recall score since the recall score, or sensitivity, finds the True positive rate or in this case the rate at which we properly identify a cheese as Low fat. I decided that this metric is important since if we want to manufacure low fat cheese and use this model to predict the manufacturing properties requierd to make low fat cheese it would be more detrimental to mistakenly make a high fat cheese than miss a potential low fat cheese. 

Looking at the features and coefficients we can see that the `MilkTypeEn_Cow, Goat and Ewe` and `MoisturePercent` significantly contribute to the model in predicting low fat cheese based on the data. This makes sense to me since a higher moister percent would mean a higher weight by water and thus lower weight by fat. Also I'd expect milk type to have the greatest impact on fat content since the majority if not all the fat in the cheese would be comming from the milk. Additionaly, We can also see that the `MilkType_Ewe and Cow` feature contributes most negatively, or in other words significantly contributes to predicting High fat. Some features that minumily contributed to the prediciton of Low/High fat content were `MilkTypeEn_Buffalo Cow` and `ManufacturingTypeEn_Industrial`. The buffalo cow milk type not contributing to the model suprises me since I would think that Buffalo milk having different properties than cow milk and thus affecting the fat content. However, Industrial Manufacturing feature does not suprise me since I would expect Industiral processes to be quite addaptable and be able to produce many different kinds of High fat or Low fat cheeses. 

I am curious about how the `FlavourEn` and `Characteristics` features would have impacted the model. I think the using these features would have improved that model since I suspect that fat content, flavour and characteristics of cheese are closely tied together. 

Furthur questions about this dataset would be if you could predict `FlavourEN` of the cheese from the other features of the cheese. Thi would be interesting since prediction the tast/smell of a cheese would be very desirable> Then the model would be albe to be used to find features which are predictive of a particular taste profiles of cheese. 


## References 

{cite}'7'
{cite}'8'