# Data Cleaning and Pre-Processing

This notebook contains the code for cleaning and pre-processing the data. The columns will be converted to numeric and categorical data types. The missing values if any will be imputed and the outliers will be removed (if any). The results will store the cleaned dataset in a new csv file.

## Importing Libraries


In [13]:
import pandas as pd

In [14]:
df = pd.read_excel('2021_Global_Nutrition_Report_Dataset_6YAamkd (1).xlsx')
df.head()

FileNotFoundError: [Errno 2] No such file or directory: '2021_Global_Nutrition_Report_Dataset_6YAamkd (1)'

Let's first observe the first and last rows of the data exracted.

In [3]:
df = pd.read_csv('recommended_nutrition.csv')
df.head(5)

Unnamed: 0,Sex,Age,Height,Weight,Activity Level,BMI,Daily Calories,Carbs,Fiber,Protein,Fat,Water,Vitamin C,Vitamin A,Vitamin D,Vitamin E,Vitamin B12,Vitamin K,Niacin,Calcium
0,Male,18 years,4 ft. 0 in.,88 lbs.,Sedentary,26.9,"1,166 kcal/day",131 - 189 grams\n,38 grams,34 grams,32 - 45 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
1,Male,18 years,4 ft. 0 in.,90 lbs.,Sedentary,27.5,"1,190 kcal/day",134 - 193 grams\n,38 grams,35 grams,33 - 46 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
2,Male,18 years,4 ft. 0 in.,93 lbs.,Sedentary,28.4,"1,227 kcal/day",138 - 199 grams\n,38 grams,36 grams,34 - 48 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
3,Male,18 years,4 ft. 0 in.,95 lbs.,Sedentary,29.0,"1,251 kcal/day",141 - 203 grams\n,38 grams,37 grams,35 - 49 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
4,Male,18 years,4 ft. 0 in.,97 lbs.,Sedentary,29.6,"1,275 kcal/day",143 - 207 grams\n,38 grams,37 grams,35 - 50 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"


In [4]:
df.tail(5)

Unnamed: 0,Sex,Age,Height,Weight,Activity Level,BMI,Daily Calories,Carbs,Fiber,Protein,Fat,Water,Vitamin C,Vitamin A,Vitamin D,Vitamin E,Vitamin B12,Vitamin K,Niacin,Calcium
70,Male,18 years,6 ft. 0 in.,132 lbs.,Sedentary,17.9,"2,249 kcal/day",253 - 365 grams\n,38 grams,51 grams,62 - 87 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
71,Male,18 years,6 ft. 0 in.,135 lbs.,Sedentary,18.3,"2,286 kcal/day",257 - 371 grams\n,38 grams,52 grams,64 - 89 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
72,Male,18 years,6 ft. 0 in.,137 lbs.,Sedentary,18.6,"2,310 kcal/day",260 - 375 grams\n,38 grams,53 grams,64 - 90 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
73,Male,18 years,6 ft. 0 in.,139 lbs.,Sedentary,18.9,"2,334 kcal/day",263 - 379 grams\n,38 grams,54 grams,65 - 91 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
74,Male,18 years,6 ft. 0 in.,141 lbs.,Sedentary,19.1,"2,358 kcal/day",265 - 383 grams\n,38 grams,54 grams,66 - 92 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"


From the first observartion, we can see that the dataset is not clean. Every column (except BMI) is an object and must be transformed to number. Also, the column names are not in a good format. We will fix these issues in this section.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             75 non-null     object 
 1   Age             75 non-null     object 
 2   Height          75 non-null     object 
 3   Weight          75 non-null     object 
 4   Activity Level  75 non-null     object 
 5   BMI             75 non-null     float64
 6   Daily Calories  75 non-null     object 
 7   Carbs           75 non-null     object 
 8   Fiber           75 non-null     object 
 9   Protein         75 non-null     object 
 10  Fat             75 non-null     object 
 11  Water           75 non-null     object 
 12  Vitamin C       75 non-null     object 
 13  Vitamin A       75 non-null     object 
 14  Vitamin D       75 non-null     object 
 15  Vitamin E       75 non-null     object 
 16  Vitamin B12     75 non-null     object 
 17  Vitamin K       75 non-null     objec

Correctly, every column is an object that must be preprocess to a number.

In [6]:
df.isnull().sum()

Sex               0
Age               0
Height            0
Weight            0
Activity Level    0
BMI               0
Daily Calories    0
Carbs             0
Fiber             0
Protein           0
Fat               0
Water             0
Vitamin C         0
Vitamin A         0
Vitamin D         0
Vitamin E         0
Vitamin B12       0
Vitamin K         0
Niacin            0
Calcium           0
dtype: int64

Luckily, the dataset does not have any missing values. So, we do not need to deal with missing values.

## Converting object columns to numeric

First we will convert the most important columns from object to numeric, so that they can be imported into the model. The features that we will convert are:
- Age
- height
- weight
- daily calories

In [7]:
# First we need to remove the dot between the numbers in Daily Calories
df['Daily Calories'] = df['Daily Calories'].str.replace(',', '')

# Then we need to exclude all text from these columns
numeric_columns = ['Age', 'Height', 'Weight', 'Daily Calories']
for col in numeric_columns:
    df[col] = df[col].str.extract('(\d+)').astype(float)


In [10]:
df['Sex'] = df['Sex'].map({'Female': 0, 'Male': 1})

In [11]:
df.head(5)

Unnamed: 0,Sex,Age,Height,Weight,Activity Level,BMI,Daily Calories,Carbs,Fiber,Protein,Fat,Water,Vitamin C,Vitamin A,Vitamin D,Vitamin E,Vitamin B12,Vitamin K,Niacin,Calcium
0,1,18.0,4.0,88.0,Sedentary,26.9,1166.0,131 - 189 grams\n,38 grams,34 grams,32 - 45 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
1,1,18.0,4.0,90.0,Sedentary,27.5,1190.0,134 - 193 grams\n,38 grams,35 grams,33 - 46 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
2,1,18.0,4.0,93.0,Sedentary,28.4,1227.0,138 - 199 grams\n,38 grams,36 grams,34 - 48 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
3,1,18.0,4.0,95.0,Sedentary,29.0,1251.0,141 - 203 grams\n,38 grams,37 grams,35 - 49 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"
4,1,18.0,4.0,97.0,Sedentary,29.6,1275.0,143 - 207 grams\n,38 grams,37 grams,35 - 50 grams\n,3.3 liters (about 14 cups)\n,75 mg,900 mcg,15 mcg,15 mg,2.4 mcg,75 mcg,16 mg,"1,300 mg"


We can see that now Age, height, weight and daily calories are numeric. We can now proceed to the next step where we can create sample model to predict how much calories a person needs to consume in a day based on its features.

## TODO: Convert the rest of the columns

## Save the dataset

In [12]:
df.to_csv('recommended_nutrition_cleaned.csv', index=False)