## Import Library

In [10]:
import pandas as pd

## Gathering Data
The dataset was collected using a web scraper from the Glycemic Index Guide [website](https://glycemic-index.net/).

In [11]:
df = pd.read_csv('https://raw.githubusercontent.com/Glucofy-Team/Glucofy-Machine-Learning/main/data/nutrition%20food%20dataset%20-%20translated.csv')

## Assessing Data
Check for missing values, duplicate data, and inaccurate values.

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 577 entries, 0 to 576
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   web-scraper-order      577 non-null    object 
 1   web-scraper-start-url  577 non-null    object 
 2   category-href          577 non-null    object 
 3   category               577 non-null    object 
 4   name                   577 non-null    object 
 5   glycemic_index         577 non-null    int64  
 6   glycemic_load          577 non-null    float64
 7   calories (kcal)        577 non-null    int64  
 8   proteins (g)           577 non-null    float64
 9   carbohydrates (g)      577 non-null    float64
 10  fats (g)               577 non-null    float64
dtypes: float64(4), int64(2), object(5)
memory usage: 49.7+ KB


In [13]:
df.isna().sum()

web-scraper-order        0
web-scraper-start-url    0
category-href            0
category                 0
name                     0
glycemic_index           0
glycemic_load            0
calories (kcal)          0
proteins (g)             0
carbohydrates (g)        0
fats (g)                 0
dtype: int64

In [14]:
df['name'].duplicated().sum()

4

In [15]:
df.describe()

Unnamed: 0,glycemic_index,glycemic_load,calories (kcal),proteins (g),carbohydrates (g),fats (g)
count,577.0,577.0,577.0,577.0,577.0,577.0
mean,41.521664,14.891854,216.918544,6.85253,28.513605,8.669133
std,24.136331,18.397778,179.784126,7.492504,27.047262,16.971254
min,0.0,0.0,2.0,0.0,0.0,0.0
25%,25.0,1.6,56.0,1.1,6.1,0.2
50%,40.0,7.0,185.0,3.5,17.2,1.4
75%,60.0,22.2,337.0,10.0,52.7,8.0
max,115.0,95.0,900.0,46.0,100.0,100.0


## Cleaning Data

- Drop unnecessary columns.

In [16]:
col = ['web-scraper-order', 'web-scraper-start-url', 'category-href']

df.drop(columns=col, inplace=True)

- Drop duplicate rows.

In [17]:
df.drop_duplicates(subset=['name'], keep='first', inplace=True)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 573 entries, 0 to 576
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   category           573 non-null    object 
 1   name               573 non-null    object 
 2   glycemic_index     573 non-null    int64  
 3   glycemic_load      573 non-null    float64
 4   calories (kcal)    573 non-null    int64  
 5   proteins (g)       573 non-null    float64
 6   carbohydrates (g)  573 non-null    float64
 7   fats (g)           573 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 40.3+ KB


- Change the data type of the numeric columns to `float64`.

In [19]:
col = ['glycemic_index', 'calories (kcal)']

df[col] = df[col].astype('float64')

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 573 entries, 0 to 576
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   category           573 non-null    object 
 1   name               573 non-null    object 
 2   glycemic_index     573 non-null    float64
 3   glycemic_load      573 non-null    float64
 4   calories (kcal)    573 non-null    float64
 5   proteins (g)       573 non-null    float64
 6   carbohydrates (g)  573 non-null    float64
 7   fats (g)           573 non-null    float64
dtypes: float64(6), object(2)
memory usage: 40.3+ KB


## Save Dataset

In [21]:
df.to_csv('nutrition food dataset - modified.csv', index=False)