**GOALS:**

This notebook contains a dataset from Aleksandr Antonov on Kaggle, that includes the nutritional information of
various common household products.  The hope is to make a classifier that can sort products into basic groupings; fruits,
vegetables, grains, dairy, processed foods, based on their nutritional information while also providing a simple toolset 
for individuals or restaurants determine the nutritional value of their meals or menu items.

**DATA:**

Data is downloaded from Kaggle: 
    https://www.kaggle.com/datasets/trolukovich/nutritional-values-for-common-foods-and-products


**LOADING THE DATA:**

Some immediate insights:
- There are 77 unique columns and 8789 rows of products
    
- The majority of columns are dtype "object" which isn't the greatest
- Values include units which will be best added to the headers to allow for mathematical operations
- There are plenty of columns that include vitamins or minerals the average person won't know or understand 'cryptoxanthin_beta' or 'isoleucine'
- The column for zinc is spell zink which isn't incorrect but I will switch the spelling for my own clarity; iron is also misspelled as irom.
- The only column with missing data (NaN values) appears to be saturated fats.

In [2]:
import pandas as pd
import numpy as np

nutrition = pd.read_csv('nutrition.csv')
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
pd.set_option('max_info_columns', 100) 

nutrition.head()


Unnamed: 0.1,Unnamed: 0,name,serving_size,calories,total_fat,saturated_fat,cholesterol,sodium,choline,folate,folic_acid,niacin,pantothenic_acid,riboflavin,thiamin,vitamin_a,vitamin_a_rae,carotene_alpha,carotene_beta,cryptoxanthin_beta,lutein_zeaxanthin,lucopene,vitamin_b12,vitamin_b6,vitamin_c,vitamin_d,vitamin_e,tocopherol_alpha,vitamin_k,calcium,copper,irom,magnesium,manganese,phosphorous,potassium,selenium,zink,protein,alanine,arginine,aspartic_acid,cystine,glutamic_acid,glycine,histidine,hydroxyproline,isoleucine,leucine,lysine,methionine,phenylalanine,proline,serine,threonine,tryptophan,tyrosine,valine,carbohydrate,fiber,sugars,fructose,galactose,glucose,lactose,maltose,sucrose,fat,saturated_fatty_acids,monounsaturated_fatty_acids,polyunsaturated_fatty_acids,fatty_acids_total_trans,alcohol,ash,caffeine,theobromine,water
0,0,Cornstarch,100 g,381,0.1g,,0,9.00 mg,0.4 mg,0.00 mcg,0.00 mcg,0.000 mg,0.000 mg,0.000 mg,0.000 mg,0.00 IU,0.00 mcg,0.00 mcg,0.00 mcg,0.00 mcg,0.00 mcg,0,0.00 mcg,0.000 mg,0.0 mg,0.00 IU,0.00 mg,0.00 mg,0.0 mcg,2.00 mg,0.050 mg,0.47 mg,3.00 mg,0.053 mg,13.00 mg,3.00 mg,2.8 mcg,0.06 mg,0.26 g,0.019 g,0.012 g,0.020 g,0.006 g,0.053 g,0.009 g,0.008 g,0,0.010 g,0.036 g,0.006 g,0.006 g,0.013 g,0.024 g,0.012 g,0.009 g,0.001 g,0.010 g,0.014 g,91.27 g,0.9 g,0.00 g,0,0,0,0,0,0,0.05 g,0.009 g,0.016 g,0.025 g,0.00 mg,0.0 g,0.09 g,0.00 mg,0.00 mg,8.32 g
1,1,"Nuts, pecans",100 g,691,72g,6.2g,0,0.00 mg,40.5 mg,22.00 mcg,0.00 mcg,1.167 mg,0.863 mg,0.130 mg,0.660 mg,56.00 IU,3.00 mcg,0.00 mcg,29.00 mcg,9.00 mcg,17.00 mcg,0,0.00 mcg,0.210 mg,1.1 mg,0.00 IU,1.40 mg,1.40 mg,3.5 mcg,70.00 mg,1.200 mg,2.53 mg,121.00 mg,4.500 mg,277.00 mg,410.00 mg,3.8 mcg,4.53 mg,9.17 g,0.397 g,1.177 g,0.929 g,0.152 g,1.829 g,0.453 g,0.262 g,0,0.336 g,0.598 g,0.287 g,0.183 g,0.426 g,0.363 g,0.474 g,0.306 g,0.093 g,0.215 g,0.411 g,13.86 g,9.6 g,3.97 g,0.04 g,0,0.04 g,0.00 g,0.00 g,3.90 g,71.97 g,6.180 g,40.801 g,21.614 g,0.00 mg,0.0 g,1.49 g,0.00 mg,0.00 mg,3.52 g
2,2,"Eggplant, raw",100 g,25,0.2g,,0,2.00 mg,6.9 mg,22.00 mcg,0.00 mcg,0.649 mg,0.281 mg,0.037 mg,0.039 mg,23.00 IU,1.00 mcg,0.00 mcg,14.00 mcg,0.00 mcg,36.00 mcg,0,0.00 mcg,0.084 mg,2.2 mg,0.00 IU,0.30 mg,0.30 mg,3.5 mcg,9.00 mg,0.081 mg,0.23 mg,14.00 mg,0.232 mg,24.00 mg,229.00 mg,0.3 mcg,0.16 mg,0.98 g,0.051 g,0.057 g,0.164 g,0.006 g,0.186 g,0.041 g,0.023 g,0,0.045 g,0.064 g,0.047 g,0.011 g,0.043 g,0.043 g,0.042 g,0.037 g,0.009 g,0.027 g,0.053 g,5.88 g,3.0 g,3.53 g,1.54 g,0,1.58 g,0,0,0.26 g,0.18 g,0.034 g,0.016 g,0.076 g,0.00 mg,0.0 g,0.66 g,0.00 mg,0.00 mg,92.30 g
3,3,"Teff, uncooked",100 g,367,2.4g,0.4g,0,12.00 mg,13.1 mg,0,0,3.363 mg,0.942 mg,0.270 mg,0.390 mg,9.00 IU,0.00 mcg,0.00 mcg,5.00 mcg,0.00 mcg,66.00 mcg,0,0,0.482 mg,0,0,0.08 mg,0.08 mg,1.9 mcg,180.00 mg,0.810 mg,7.63 mg,184.00 mg,9.240 mg,429.00 mg,427.00 mg,4.4 mcg,3.63 mg,13.30 g,0.747 g,0.517 g,0.820 g,0.236 g,3.349 g,0.477 g,0.301 g,0,0.501 g,1.068 g,0.376 g,0.428 g,0.698 g,0.664 g,0.622 g,0.510 g,0.139 g,0.458 g,0.686 g,73.13 g,8.0 g,1.84 g,0.47 g,0.00 g,0.73 g,0.00 g,0.01 g,0.62 g,2.38 g,0.449 g,0.589 g,1.071 g,0,0,2.37 g,0,0,8.82 g
4,4,"Sherbet, orange",100 g,144,2g,1.2g,1mg,46.00 mg,7.7 mg,4.00 mcg,0.00 mcg,0.063 mg,0.224 mg,0.097 mg,0.027 mg,46.00 IU,12.00 mcg,0.00 mcg,1.00 mcg,5.00 mcg,7.00 mcg,0,0.13 mcg,0.023 mg,2.3 mg,0.00 IU,0.01 mg,0.01 mg,0.0 mcg,54.00 mg,0.028 mg,0.14 mg,8.00 mg,0.011 mg,40.00 mg,96.00 mg,1.5 mcg,0.48 mg,1.10 g,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,30.40 g,1.3 g,24.32 g,0,0,0,0,0,0,2.00 g,1.160 g,0.530 g,0.080 g,1.00 mg,0.0 g,0.40 g,0.00 mg,0.00 mg,66.10 g


In [3]:
#nutrition.columns
print(nutrition.dtypes.to_string())

Unnamed: 0                      int64
name                           object
serving_size                   object
calories                        int64
total_fat                      object
saturated_fat                  object
cholesterol                    object
sodium                         object
choline                        object
folate                         object
folic_acid                     object
niacin                         object
pantothenic_acid               object
riboflavin                     object
thiamin                        object
vitamin_a                      object
vitamin_a_rae                  object
carotene_alpha                 object
carotene_beta                  object
cryptoxanthin_beta             object
lutein_zeaxanthin              object
lucopene                        int64
vitamin_b12                    object
vitamin_b6                     object
vitamin_c                      object
vitamin_d                      object
vitamin_e   

In [4]:
nutrition.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8789 entries, 0 to 8788
Data columns (total 77 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Unnamed: 0                   8789 non-null   int64 
 1   name                         8789 non-null   object
 2   serving_size                 8789 non-null   object
 3   calories                     8789 non-null   int64 
 4   total_fat                    8789 non-null   object
 5   saturated_fat                7199 non-null   object
 6   cholesterol                  8789 non-null   object
 7   sodium                       8789 non-null   object
 8   choline                      8789 non-null   object
 9   folate                       8789 non-null   object
 10  folic_acid                   8789 non-null   object
 11  niacin                       8789 non-null   object
 12  pantothenic_acid             8789 non-null   object
 13  riboflavin                   8789

**SATURATED FAT NAN VALUES:**

saturated_fat is the only column with NaN values, and at first glance they appear to be products with very low total
fat count.  While it maybe easy to replace the saturated NaN value with a 0, there are outliers like Pan Dulce at index 8772 so further exploration is needed.


In [5]:
# Printing out saturated_fat NaN values

nan_SF = nutrition[nutrition['saturated_fat'].isna()]
#print(nan_SF)

# Commenting out the print statement because it will be nearly illegible outside of the workspace.

**DATA CLEANING:**

As mentioned above there are 77 unique features of this dataset and for the majority of people 85% of them will be 
unneccesary.  Copying and limiting the dataset to 12 features will ensure the integrity of the original set while providing the most useful results for our purposes.

Included columns are:

    'name', 'serving_size', 'calories', 'total_fat', 'saturated_fat', 'cholesterol', 'sodium', 'sugars', 'carbohydrate', 'fiber', 'alcohol', 'caffeine'

In [7]:
nutrition_filtered = nutrition[['name', 'serving_size', 'calories', 'total_fat', 'saturated_fat', 'cholesterol', 'sodium', 
                               'sugars', 'carbohydrate', 'fiber', 'alcohol', 'caffeine']]
print(nutrition_filtered.head())

              name serving_size  calories total_fat saturated_fat cholesterol    sodium   sugars carbohydrate  fiber alcohol caffeine
0       Cornstarch        100 g       381      0.1g           NaN           0   9.00 mg   0.00 g      91.27 g  0.9 g   0.0 g  0.00 mg
1     Nuts, pecans        100 g       691       72g          6.2g           0   0.00 mg   3.97 g      13.86 g  9.6 g   0.0 g  0.00 mg
2    Eggplant, raw        100 g        25      0.2g           NaN           0   2.00 mg   3.53 g       5.88 g  3.0 g   0.0 g  0.00 mg
3   Teff, uncooked        100 g       367      2.4g          0.4g           0  12.00 mg   1.84 g      73.13 g  8.0 g       0        0
4  Sherbet, orange        100 g       144        2g          1.2g         1mg  46.00 mg  24.32 g      30.40 g  1.3 g   0.0 g  0.00 mg


**CLEANING THE UNITS:**

Each feature after name has additional units applied to each value.  The units are useful but make using the data in numerical operations impossible.  Moving the units to the column name will keep the information but make the data more useable.

In [14]:
# First we need to go column by column and ensure the units are uniform for each column.  If the serving size for
# 'apple' is in 'ea' for each that'll throw off any stripping we do.

# all_grams_n where n is the index of the column name

# serving_size
all_grams_2 = nutrition['serving_size'].str.match(r'^\d+\s*g$').all()
#print(all_grams_2) #True

# total_fat
all_grams_4 = nutrition['serving_size'].str.match(r'^\d+\s*g$').all()
#print(all_grams_4) #True

# saturated_fat
#all_grams_5 = nutrition['serving_size'].str.match(r'^\d+\s*g$').all()
#print(all_grams_5)

# sodium
all_grams_7 = nutrition['serving_size'].str.match(r'^\d+\s*mg$').all()
#print(all_grams_7) #False

# sugars
all_grams_8 = nutrition['serving_size'].str.match(r'^\d+\s*g$').all()
#print(all_grams_8) #True

# carbohydrate
all_grams_9 = nutrition['serving_size'].str.match(r'^\d+\s*g$').all()
#print(all_grams_9) #True

# fiber
all_grams_10 = nutrition['serving_size'].str.match(r'^\d+\s*g$').all()
print(all_grams_10) #True


True


**SODIUM:**

The unit check for sodium came back False meaning that some values within that column have units that aren't 'mg' and finding if those units are omitted completely like they are in cholesterol, alcohol, and caffeine or are a different measurement is important.  Luckily the units are metric so numeric conversion is simplified.

- The units for sodium were found to be mg, and nan which means luckily no numeric conversions are necessary and sodium can be folded into the next function which will deal with that problem in other columns.

In [29]:
# Extract the non-numeric part of each value in the 'sodium' column
units = nutrition['sodium'].str.extract(r'(\D+)$')

# Find the unique units
unique_units = units[0].unique()

print(unique_units)

# Checking the non-numeric part of cholesterol values because my function is 
# producing the expected results.  Checking my expectations really.

units_chol = nutrition['cholesterol'].str.extract(r'(\D+)$')
unique_chol_units = units[0].unique()
print(unique_chol_units)

[' mg' nan]
[' mg' nan]


**SODIUM, CHOLESTEROL, ALCOHOL, AND CAFFEINE:**

These four columns have a mix of values with units and without units.  To make sure all units match we need to create a separate function that can be repeatedly called to make sure those string values found within these columns have the same units so we can strip them of those unit abbreviations.

In [34]:
def check_nutrition_units(value):
    if isinstance(value, str):
        return value.endswith('g') or value.endswith('mg') or value == '0'
    return value == 0  # Explicitly check for numeric 0 values

# Apply the function to the 'cholesterol' column
cholesterol_check = nutrition['cholesterol'].apply(check_nutrition_units)
sodium_check = nutrition['sodium'].apply(check_nutrition_units)
alcohol_check = nutrition['alcohol'].apply(check_nutrition_units)
caffeine_check = nutrition['caffeine'].apply(check_nutrition_units)

# Troubleshooting and testing my function to make sure the outputs are as expected
        # failing_rows = nutrition[~sodium_check]
        # print(failing_rows[['sodium']])
        # print(failing_rows['sodium'].apply(type).unique())

# Alcohol is in units 'g' so the first run produced false.  Adding a caveat for 'g' 
# and not just 'mg' in the function solved the problem.
        # failing_rows = nutrition[~alcohol_check]
        # print(failing_rows[['alcohol']])
        # print(failing_rows['alcohol'].apply(type).unique())



# Print the result
print(cholesterol_check.all())
print(sodium_check.all())
print(alcohol_check.all())
print(caffeine_check.all())

True
True
True
True


**STRIPPING UNITS:**

Now that we've surmized that the units are indeed uniform for our columns so long as the row value is non-zero we can carefully strip the string values off each column and check/change the datatype to numeric for future calculations.