## Data Understanding

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Import the required python libraries/modules. to clean and preprocess data we only use numpy and pandas.

In [None]:
import pandas as pd
import numpy as np

Read the food nutrition data using the read_csv function from pandas.

In [None]:
df = pd.read_csv('drive/MyDrive/data_nutrisi.csv')

To find out brief information including the number of rows and columns, the data type of each column, and the number of non-null values in each column of the food data, the info() method of the pandas library is used.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1148 entries, 0 to 1147
Data columns (total 28 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             1148 non-null   int64 
 1   kode           1148 non-null   object
 2   makanan        1148 non-null   object
 3   air            1148 non-null   object
 4   kalori         1148 non-null   int64 
 5   protein        1148 non-null   object
 6   lemak          1148 non-null   object
 7   karbohidrat    1148 non-null   object
 8   serat          1148 non-null   object
 9   abu            1148 non-null   object
 10  kalsium        1148 non-null   object
 11  fosfor         1148 non-null   object
 12  besi           1148 non-null   object
 13  natrium        1148 non-null   object
 14  kalium         1148 non-null   object
 15  tembaga        1148 non-null   object
 16  seng           1148 non-null   object
 17  vitamin A      1148 non-null   object
 18  beta karoten   1148 non-null

there is bunch of information there to find out the example of food data. we used head() method, it will return 5 first example of the dataset

In [None]:
df.head()

Unnamed: 0,id,kode,makanan,air,kalori,protein,lemak,karbohidrat,serat,abu,...,beta karoten,karoten total,vitamin B1,vitamin B2,Niasin,Vitamin C,BDD,Mentah/Olahan,Kategori,Sumber
0,1,AP001,Nasi,567,180,30,3,398,02,2,...,-,-,005,010,26,-,100,Olahan,Serealia,KZGPI-1990
1,2,AP002,Nasi tim,710,120,24,4,260,05,2,...,-,-,010,-,14,-,100,Olahan,Serealia,OKN-1992
2,3,AP003,Tapai beras,755,99,17,3,224,-,1,...,-,-,-,-,-,-,100,Olahan,Serealia,KZGMI-2001
3,4,AP004,"Tepung beras, mentah",120,353,70,5,800,24,5,...,-,-,012,010,12,-,100,Olahan,Serealia,DABM-1964
4,5,AP005,Nasi beras merah,640,149,28,4,325,03,3,...,-,-,006,-,16,-,100,Olahan,Serealia,KZGPI-1990


There is a problem with the data in the food nutrient content. The data type should be numbers, like float or integer, but instead, it's stored as object or string. Additionally, some nutritional data is marked with '-' signs instead of being recorded as 0. To fix this, we need to clean up the data.

## Data Cleaning

We create `convert_to_numeric` function that will convert data with object type into numeric type while also replace '-' sign into 0 and we use `applymap()` method to apply the function to each value within the column with object type data.

In [None]:
column = ['air',  'protein', 'lemak',
       'karbohidrat', 'serat', 'abu', 'kalsium', 'fosfor', 'besi', 'natrium',
       'kalium', 'tembaga', 'seng', 'vitamin A', 'beta karoten',
       'karoten total', 'vitamin B1', 'vitamin B2', 'Niasin', 'Vitamin C']

def convert_to_numeric(value):
    if value == '-':
        return 0
    else:
        return pd.to_numeric(value.replace(',', '.'), errors='coerce')

df[column] = df[column].applymap(convert_to_numeric)

To check everything is done correctly we use info() method

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1148 entries, 0 to 1147
Data columns (total 28 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             1148 non-null   int64  
 1   kode           1148 non-null   object 
 2   makanan        1148 non-null   object 
 3   air            1148 non-null   float64
 4   kalori         1148 non-null   int64  
 5   protein        1148 non-null   float64
 6   lemak          1148 non-null   float64
 7   karbohidrat    1148 non-null   float64
 8   serat          1148 non-null   float64
 9   abu            1148 non-null   float64
 10  kalsium        1148 non-null   float64
 11  fosfor         1148 non-null   float64
 12  besi           1148 non-null   float64
 13  natrium        1148 non-null   float64
 14  kalium         1105 non-null   float64
 15  tembaga        1146 non-null   float64
 16  seng           1148 non-null   float64
 17  vitamin A      1148 non-null   float64
 18  beta kar

In [None]:
df.head()

Unnamed: 0,id,kode,makanan,air,kalori,protein,lemak,karbohidrat,serat,abu,...,beta karoten,karoten total,vitamin B1,vitamin B2,Niasin,Vitamin C,BDD,Mentah/Olahan,Kategori,Sumber
0,1,AP001,Nasi,56.7,180,3.0,0.3,39.8,0.2,0.2,...,0.0,0.0,0.05,0.1,2.6,0.0,100,Olahan,Serealia,KZGPI-1990
1,2,AP002,Nasi tim,71.0,120,2.4,0.4,26.0,0.5,0.2,...,0.0,0.0,0.1,0.0,1.4,0.0,100,Olahan,Serealia,OKN-1992
2,3,AP003,Tapai beras,75.5,99,1.7,0.3,22.4,0.0,0.1,...,0.0,0.0,0.0,0.0,0.0,0.0,100,Olahan,Serealia,KZGMI-2001
3,4,AP004,"Tepung beras, mentah",12.0,353,7.0,0.5,80.0,2.4,0.5,...,0.0,0.0,0.12,0.1,1.2,0.0,100,Olahan,Serealia,DABM-1964
4,5,AP005,Nasi beras merah,64.0,149,2.8,0.4,32.5,0.3,0.3,...,0.0,0.0,0.06,0.0,1.6,0.0,100,Olahan,Serealia,KZGPI-1990


## Data Preprocessing

For the project we will only use macro nutrient. Therefore all micro nutrient data and other not useful column will be dropped

In [None]:
column_tidak_berguna = ['id', 'kode', 'abu', 'kalsium', 'fosfor', 'besi', 'natrium',
       'kalium', 'tembaga', 'seng', 'vitamin A', 'beta karoten',
       'karoten total', 'vitamin B1', 'vitamin B2', 'Niasin', 'Vitamin C',
       'BDD', 'Sumber']

df = df.drop(column_tidak_berguna, axis =1)

For the project we only use 'olahan' food so we create new dataframe that will  encompass all 'olahan' food

In [None]:
df['Mentah/Olahan'].value_counts()

Olahan     588
Tunggal    560
Name: Mentah/Olahan, dtype: int64

In [None]:
olahan = df[df['Mentah/Olahan'] == 'Olahan'].copy()
olahan = olahan.reset_index(drop=True)

In [None]:
olahan

Unnamed: 0,makanan,air,kalori,protein,lemak,karbohidrat,serat,Mentah/Olahan,Kategori
0,Nasi,56.7,180,3.0,0.3,39.8,0.2,Olahan,Serealia
1,Nasi tim,71.0,120,2.4,0.4,26.0,0.5,Olahan,Serealia
2,Tapai beras,75.5,99,1.7,0.3,22.4,0.0,Olahan,Serealia
3,"Tepung beras, mentah",12.0,353,7.0,0.5,80.0,2.4,Olahan,Serealia
4,Nasi beras merah,64.0,149,2.8,0.4,32.5,0.3,Olahan,Serealia
...,...,...,...,...,...,...,...,...,...
583,Saos tomat,69.5,110,2.0,0.4,24.5,0.9,Olahan,Bumbu
584,Tempoya,70.1,110,1.7,1.3,21.9,2.6,Olahan,Bumbu
585,Terasi,33.8,155,22.3,2.9,9.9,2.7,Olahan,Bumbu
586,Terasi dobo,58.4,191,33.1,3.6,6.6,1.6,Olahan,Bumbu


We also remove alll data that contains string "mentah" to make sure our data only focused on 'olahan' food

In [None]:
olahan = olahan[~olahan['makanan'].str.contains('mentah', case=False)].reset_index(drop=True)
olahan

Unnamed: 0,makanan,air,kalori,protein,lemak,karbohidrat,serat,Mentah/Olahan,Kategori
0,Nasi,56.7,180,3.0,0.3,39.8,0.2,Olahan,Serealia
1,Nasi tim,71.0,120,2.4,0.4,26.0,0.5,Olahan,Serealia
2,Tapai beras,75.5,99,1.7,0.3,22.4,0.0,Olahan,Serealia
3,Nasi beras merah,64.0,149,2.8,0.4,32.5,0.3,Olahan,Serealia
4,Bihun goreng instan,9.0,381,6.1,3.9,80.3,0.0,Olahan,Serealia
...,...,...,...,...,...,...,...,...,...
535,Saos tomat,69.5,110,2.0,0.4,24.5,0.9,Olahan,Bumbu
536,Tempoya,70.1,110,1.7,1.3,21.9,2.6,Olahan,Bumbu
537,Terasi,33.8,155,22.3,2.9,9.9,2.7,Olahan,Bumbu
538,Terasi dobo,58.4,191,33.1,3.6,6.6,1.6,Olahan,Bumbu


We create new feature 'vegan\nonvegan' based on food category so if category is 'Daging', 'Telur', 'Lemak', 'Susu', 'Ikan dsb' the value of the feature will be 'non-vegan' and other than that the value will be vegan

In [None]:
olahan['vegan/nonvegan'] = 'vegan'

non_vegan = ['Daging', 'Telur', 'Lemak', 'Susu', 'Ikan dsb']

for index, row in olahan.iterrows():
    if row['Kategori'] in non_vegan:
        olahan.at[index, 'vegan/nonvegan'] = 'non_vegan'

In [None]:
olahan

Unnamed: 0,makanan,air,kalori,protein,lemak,karbohidrat,serat,Mentah/Olahan,Kategori,Vegan_NonVegan
0,Nasi,56.7,180,3.0,0.3,39.8,0.2,Olahan,Serealia,vegan
1,Nasi tim,71.0,120,2.4,0.4,26.0,0.5,Olahan,Serealia,vegan
2,Tapai beras,75.5,99,1.7,0.3,22.4,0.0,Olahan,Serealia,vegan
3,Nasi beras merah,64.0,149,2.8,0.4,32.5,0.3,Olahan,Serealia,vegan
4,Bihun goreng instan,9.0,381,6.1,3.9,80.3,0.0,Olahan,Serealia,vegan
...,...,...,...,...,...,...,...,...,...,...
535,Saos tomat,69.5,110,2.0,0.4,24.5,0.9,Olahan,Bumbu,vegan
536,Tempoya,70.1,110,1.7,1.3,21.9,2.6,Olahan,Bumbu,vegan
537,Terasi,33.8,155,22.3,2.9,9.9,2.7,Olahan,Bumbu,vegan
538,Terasi dobo,58.4,191,33.1,3.6,6.6,1.6,Olahan,Bumbu,vegan


The column 'Mentah/Olahan' will be dropped because it's not useful in our project

In [None]:
olahan.drop('Mentah/Olahan', axis = 1)

Unnamed: 0,makanan,air,kalori,protein,lemak,karbohidrat,serat,Kategori,Vegan_NonVegan
0,Nasi,56.7,180,3.0,0.3,39.8,0.2,Serealia,vegan
1,Nasi tim,71.0,120,2.4,0.4,26.0,0.5,Serealia,vegan
2,Tapai beras,75.5,99,1.7,0.3,22.4,0.0,Serealia,vegan
3,Nasi beras merah,64.0,149,2.8,0.4,32.5,0.3,Serealia,vegan
4,Bihun goreng instan,9.0,381,6.1,3.9,80.3,0.0,Serealia,vegan
...,...,...,...,...,...,...,...,...,...
535,Saos tomat,69.5,110,2.0,0.4,24.5,0.9,Bumbu,vegan
536,Tempoya,70.1,110,1.7,1.3,21.9,2.6,Bumbu,vegan
537,Terasi,33.8,155,22.3,2.9,9.9,2.7,Bumbu,vegan
538,Terasi dobo,58.4,191,33.1,3.6,6.6,1.6,Bumbu,vegan


we exported our data into csv

In [None]:
olahan.to_csv('olahan.csv')

## Further Analysis
To conduct a more detailed analysis, we utilize a `spreadsheet` to examine our data. Here are the steps we take within the `spreadsheet`:

1. Data Reduction: We minimize the amount of data by selecting only the most popular food items, reducing the number of records from 540 to 77.

2. Introducing a New Feature: We incorporate a new feature called 'food_tag'. This feature encompasses information about how the food is cooked, its category, and its taste.