## Import Library

In [1]:
import pandas as pd

## Gathering Data
The dataset was collected using a web scraper from the Glycemic Index Guide [website](https://glycemic-index.net/).

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Glucofy-Team/Glucofy-Machine-Learning/main/data/nutrition%20food%20dataset%20-%20translated.csv')

## Assessing Data
Check for missing values, duplicate data, and inaccurate values.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 586 entries, 0 to 585
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   web-scraper-order      586 non-null    object 
 1   web-scraper-start-url  586 non-null    object 
 2   category-href          586 non-null    object 
 3   category               586 non-null    object 
 4   name                   586 non-null    object 
 5   glycemic_index         586 non-null    int64  
 6   glycemic_load          586 non-null    float64
 7   calories (kcal)        586 non-null    int64  
 8   proteins (g)           586 non-null    float64
 9   carbohydrates (g)      586 non-null    float64
 10  fats (g)               586 non-null    float64
dtypes: float64(4), int64(2), object(5)
memory usage: 50.5+ KB


In [4]:
df.isna().sum()

web-scraper-order        0
web-scraper-start-url    0
category-href            0
category                 0
name                     0
glycemic_index           0
glycemic_load            0
calories (kcal)          0
proteins (g)             0
carbohydrates (g)        0
fats (g)                 0
dtype: int64

In [5]:
df['name'].duplicated().sum()

17

In [6]:
df.describe()

Unnamed: 0,glycemic_index,glycemic_load,calories (kcal),proteins (g),carbohydrates (g),fats (g)
count,586.0,586.0,586.0,586.0,586.0,586.0
mean,41.298635,14.779181,217.59727,6.961792,28.504522,8.719266
std,24.131008,18.316097,180.102454,7.618162,27.074133,16.960519
min,0.0,0.0,2.0,0.0,0.0,0.0
25%,25.0,1.6,56.0,1.125,6.025,0.2
50%,40.0,7.0,187.0,3.65,17.1,1.45
75%,60.0,22.175,339.25,10.075,52.775,8.0
max,115.0,95.0,900.0,46.0,100.0,100.0


## Cleaning Data

- Drop unnecessary columns.

In [7]:
col = ['web-scraper-order', 'web-scraper-start-url', 'category-href']

df.drop(columns=col, inplace=True)

- Drop duplicate rows.

In [8]:
df.drop_duplicates(subset=['name'], inplace=True)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 0 to 585
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   category           569 non-null    object 
 1   name               569 non-null    object 
 2   glycemic_index     569 non-null    int64  
 3   glycemic_load      569 non-null    float64
 4   calories (kcal)    569 non-null    int64  
 5   proteins (g)       569 non-null    float64
 6   carbohydrates (g)  569 non-null    float64
 7   fats (g)           569 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 40.0+ KB


- Change the data type of the numeric columns to `float64`.

In [10]:
col = ['glycemic_index', 'calories (kcal)']

df[col] = df[col].astype('float64')

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 569 entries, 0 to 585
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   category           569 non-null    object 
 1   name               569 non-null    object 
 2   glycemic_index     569 non-null    float64
 3   glycemic_load      569 non-null    float64
 4   calories (kcal)    569 non-null    float64
 5   proteins (g)       569 non-null    float64
 6   carbohydrates (g)  569 non-null    float64
 7   fats (g)           569 non-null    float64
dtypes: float64(6), object(2)
memory usage: 40.0+ KB


## Save Dataset

In [13]:
df.to_csv('nutrition food dataset - modified.csv', index=False)