#  ML Final Project - Predicting fat level in Canadian cheese
## Jesse Sallis -Bcom, MIB
### March 2022

Introduction: Purpose of notebook is to explore a Canadian cheese dataset and build/test multiple machine learning models to predict fat level of various cheeses. Analysis will consist of importing/cleaning dataset, developing baseline model, testing multple machine learning models and tuning hyperparameters to create the most accurate model possible.


### Question that will be explored:
* Can Canadian cheese be accurately classified either low-fat or high-fat based on qualitative and quantative features?

This is a classification question where each unique cheese will be classified as either lower or higher fat. The purpose of predicting fat levels is health driven; perhaps due to food allergies or diet restrictions we want to avoid cheese with high fat levels. Given this, our positive label will be lower fat. Looking over the data, I expect we will be able to develop a proficient model based on trends that are an effective predicter of fat content. For example, cheese made from goat milk with a low moisture content tends to have a certain fat level - if strong relationships like this exist within the dataset then our model will identify this and help drive predictive results.

                
### Datasets

For the purpose of this excerise, 1 datasets will be used. Data can be found here https://data.amerigeoss.org/dataset/3c16cd48-3ac3-453f-8260-6f745181c83b and follows a https://open.canada.ca/en/open-government-licence-canada license.
                
* **cheese_data.csv**
    * collection of various Canadian cheese with qualitative and quantative descriptors (milk type, moisture content and manufactor location)

### Method and Results

*Using the above dataset, analysis process will flow like:*
* Load in dataset as dataframe
* Split train/test splits, Run initial diagnoisis on data quality (NaNs, missing values, incorect dtypes)
* Determine appropiate columns feature and what scaling/imputation steps are needed
* Create Dummy model for comparisons
* Create pipeline and use hyper pararmeter tuning on model to determine best scoring model
* Demonstrate prediction scores of best scoring model

In [2]:
#importing required packages, additional packages will be imported as needed
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_validate

#importing dataset
cheese_df = pd.read_csv('data/cheese_data.csv')
cheese_df.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/cheese_data.csv'

In [None]:
#splitting train/test
train_df, test_df = train_test_split(cheese_df, test_size=0.2, random_state=55,shuffle=True, stratify=None)
train_df.head()

## Exploring our dataset
 
After splitting test/train, we can now look at our dataset's characateristics. Features are a combination of quantatative (moisture percent) and quantative (CheeseName) descriptors. Our target will be the column *FatLevel* with balanced classes of low/high fat.

The majority of the columns are of dtype object, modification (ex: one-hot encoding, bag of words) will need to be applied for columns to be useful within our model. CheeseId being an identification feature, can be dropped. 3 columns have significant NaNs (FlavourEn,CharacteristicsEn,RindTypeEn), all object dtypes; easiest solution would be to drop these columns however that could mean depriving the model of potentially useful features. Simpleimputator with strategy='most_frequenet' will work for these and mean values for int columns will work. RindType, given only 4 unique values and such high number of NaN's can be dropped. Organic column is current in binary form - this is helpful as it's already transformed.

In [None]:
print(train_df.info())
train_df.describe(include='all')

In [None]:
#classes are balanced so no additional transformations needed, no need to use stratify parameter for train_test_split
train_df['FatLevel'].value_counts()/train_df['FatLevel'].count()

In [None]:
#splitting in X/Y
X_train = train_df
y_train = train_df['FatLevel']
X_test = test_df
y_test = test_df['FatLevel']

In [None]:
import altair as alt

In [None]:
#line chart
base = alt.Chart(train_df)

bar = base.mark_area().encode(
    x=alt.X('MoisturePercent:Q', bin=alt.Bin(maxbins=100),
    y=alt.Y('count()',stack=None)
)

rule = base.mark_rule(color='red').encode(
    x='mean(MoisturePercent):Q',
    size=alt.value(5)
)

bar + rule

In [None]:
#histogram

## References

* https://altair-viz.github.io/gallery/histogram_with_a_global_mean_overlay.html