#  ML Final Project - Predicting fat level in Canadian cheese
## Jesse Sallis -Bcom, MIB
### March 2022

Introduction: Purpose of notebook is to explore a Canadian cheese dataset and build/test multiple machine learning models to predict fat level of various cheeses. Analysis will consist of importing/cleaning dataset, developing baseline model, testing multple machine learning models and tuning hyperparameters to create the most accurate model possible.


### Question that will be explored:
* Can Canadian cheese be accurately classified either low-fat or high-fat based on qualitative and quantative features?

This is a classification question where each unique cheese will be classified as either lower or higher fat. The purpose of predicting fat levels is health driven; perhaps due to food allergies or diet restrictions we want to avoid cheese with high fat levels. Given this, our positive label will be lower fat. Looking over the data, I expect we will be able to develop a proficient model based on trends that are an effective predicter of fat content. For example, cheese made from goat milk with a low moisture content tends to have a certain fat level - if strong relationships like this exist within the dataset then our model will identify this and help drive predictive results.

                
### Datasets

For the purpose of this excerise, 1 datasets will be used. Data can be found here https://data.amerigeoss.org/dataset/3c16cd48-3ac3-453f-8260-6f745181c83b and follows a https://open.canada.ca/en/open-government-licence-canada license.
                
* **cheese_data.csv**
    * collection of various Canadian cheese with qualitative and quantative descriptors (milk type, moisture content and manufactor location)

### Method and Results

*Using the above dataset, analysis process will flow like:*
* Load in dataset as dataframe
* Split train/test splits, Run initial diagnoisis on data quality (NaNs, missing values, incorect dtypes)
* Determine appropiate columns feature and what scaling/imputation steps are needed
* Create Dummy model for comparisons
* Create pipeline and use hyper pararmeter tuning on model to determine best scoring model
* Demonstrate prediction scores of best scoring model

In [1]:
#importing required packages, additional packages will be imported as needed
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_validate

#importing dataset
cheese_df = pd.read_csv('data/cheese_data.csv')
cheese_df.head()

Unnamed: 0,CheeseId,ManufacturerProvCode,ManufacturingTypeEn,MoisturePercent,FlavourEn,CharacteristicsEn,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,RindTypeEn,CheeseName,FatLevel
0,228,NB,Farmstead,47.0,"Sharp, lactic",Uncooked,0,Firm Cheese,Ewe,Raw Milk,Washed Rind,Sieur de Duplessis (Le),lower fat
1,242,NB,Farmstead,47.9,"Sharp, lactic, lightly caramelized",Uncooked,0,Semi-soft Cheese,Cow,Raw Milk,Washed Rind,Tomme Le Champ Doré,lower fat
2,301,ON,Industrial,54.0,"Mild, tangy, and fruity","Pressed and cooked cheese, pasta filata, inter...",0,Firm Cheese,Cow,Pasteurized,,Provolone Sette Fette (Tre-Stelle),lower fat
3,303,NB,Farmstead,47.0,Sharp with fruity notes and a hint of wild honey,,0,Veined Cheeses,Cow,Raw Milk,,Geai Bleu (Le),lower fat
4,319,NB,Farmstead,49.4,Softer taste,,1,Semi-soft Cheese,Cow,Raw Milk,Washed Rind,Gamin (Le),lower fat


In [8]:
#splitting train/test
train_df, test_df = train_test_split(cheese_df, test_size=0.2, random_state=55,shuffle=True, stratify=None)
train_df.head()

Unnamed: 0,CheeseId,ManufacturerProvCode,ManufacturingTypeEn,MoisturePercent,FlavourEn,CharacteristicsEn,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,RindTypeEn,CheeseName,FatLevel
570,1605,QC,Artisan,48.0,Mild,"Camembert type cheese, white bloomy rind that ...",0,Soft Cheese,Cow,Pasteurized,Bloomy Rind,Petit Normand (Le),lower fat
1020,2360,NB,Farmstead,50.0,With a pepper and garlic flavour,Creamy cheese with pepper and garlic,0,Fresh Cheese,Goat,Pasteurized,,Poivroux (Le),lower fat
258,1278,QC,Industrial,55.0,"Creamy taste, exudes intense and expansive aroma","Surface ripened, washed, shiny rind which is s...",0,Soft Cheese,Cow,Pasteurized,Washed Rind,Sir Laurier d'Arthabaska,lower fat
26,662,ON,Industrial,39.0,,,0,Firm Cheese,Cow,Pasteurized,No Rind,Cheddar (Balderson),higher fat
350,1374,QC,Industrial,49.0,Mild,,0,Firm Cheese,Cow,Pasteurized,No Rind,Suisse Grubec léger,lower fat


## Exploring our dataset
 
After splitting test/train, we can now look at our dataset's characateristics. Features are a combination of quantatative (moisture percent) and quantative (CheeseName) descriptors. Our target will be the column *FatLevel* with balanced classes of low/high fat.

The majority of the columns are of dtype object, modification (ex: one-hot encoding, bag of words) will need to be applied for columns to be useful within our model. CheeseId being an identification feature, can be dropped. 3 columns have significant NaNs (FlavourEn,CharacteristicsEn,RindTypeEn), all object dtypes; easiest solution would be to drop these columns however that could mean depriving the model of potentially useful features. Simpleimputator with strategy='most_frequenet' will work for these and mean values for int columns will work. RindType, given only 4 unique values and such high number of NaN's can be dropped. Organic column is current in binary form - this is helpful as it's already transformed.

In [14]:
print(train_df.info())
train_df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 833 entries, 570 to 461
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   CheeseId              833 non-null    int64  
 1   ManufacturerProvCode  833 non-null    object 
 2   ManufacturingTypeEn   833 non-null    object 
 3   MoisturePercent       824 non-null    float64
 4   FlavourEn             636 non-null    object 
 5   CharacteristicsEn     515 non-null    object 
 6   Organic               833 non-null    int64  
 7   CategoryTypeEn        812 non-null    object 
 8   MilkTypeEn            832 non-null    object 
 9   MilkTreatmentTypeEn   780 non-null    object 
 10  RindTypeEn            580 non-null    object 
 11  CheeseName            833 non-null    object 
 12  FatLevel              833 non-null    object 
dtypes: float64(1), int64(2), object(10)
memory usage: 91.1+ KB
None


Unnamed: 0,CheeseId,ManufacturerProvCode,ManufacturingTypeEn,MoisturePercent,FlavourEn,CharacteristicsEn,Organic,CategoryTypeEn,MilkTypeEn,MilkTreatmentTypeEn,RindTypeEn,CheeseName,FatLevel
count,833.0,833,833,824.0,636,515,833.0,812,832,780,580,833,833
unique,,10,3,,519,443,,6,8,3,4,830,2
top,,QC,Industrial,,Mild,Creamy,,Firm Cheese,Cow,Pasteurized,No Rind,Ménestrel (Le),lower fat
freq,,636,371,,42,15,,278,595,642,338,2,552
mean,1569.127251,,,47.100121,,,0.092437,,,,,,
std,457.039042,,,9.658157,,,0.289816,,,,,,
min,228.0,,,17.0,,,0.0,,,,,,
25%,1282.0,,,40.0,,,0.0,,,,,,
50%,1562.0,,,46.0,,,0.0,,,,,,
75%,1919.0,,,52.0,,,0.0,,,,,,


In [16]:
#classes are balanced so no additional transformations needed
train_df['FatLevel'].value_counts()/train_df['FatLevel'].count()

lower fat     0.662665
higher fat    0.337335
Name: FatLevel, dtype: float64

In [None]:
#splitting in X/Y
