### **Data Validation**


This dataset contains **947 rows** and **8 columns**. After validation, the following observations and cleaning steps were identified for each column:

**1. recipe**: 947 unique numeric values without missing values, as described. No cleaning is needed.

**2. calories**: Numeric column with 52 missing values. Missing values filled with the median.

**3. carbohydrate**: Numeric column with 52 missing values. Missing values filled with the median.

**4. sugar**: Numeric column with 52 missing values. Missing values filled with the median.

**5. protein**: Numeric column with 52 missing values. Missing values filled with the median.

**6. category**: 11 unique categories without missing values, exactly as described. No cleaning needed.

**7. servings**: Originally stored as object, converted to numeric for consistency. No missing values. Converted to integer format.

**8. **high_traffic****: Contains "High" and 373 missing values, which indicate low traffic. Missing values replaced with "Low".


In [36]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [37]:
# Load dataset
recipe_data = pd.read_csv("../data/raw/recipe_site_traffic.csv")

# Display dataset information
recipe_data.info()

# Check missing values per column
print("Missing values per column:")
print(recipe_data.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 947 entries, 0 to 946
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   recipe        947 non-null    int64  
 1   calories      895 non-null    float64
 2   carbohydrate  895 non-null    float64
 3   sugar         895 non-null    float64
 4   protein       895 non-null    float64
 5   category      947 non-null    object 
 6   servings      947 non-null    object 
 7   high_traffic  574 non-null    object 
dtypes: float64(4), int64(1), object(3)
memory usage: 59.3+ KB
Missing values per column:
recipe            0
calories         52
carbohydrate     52
sugar            52
protein          52
category          0
servings          0
high_traffic    373
dtype: int64


In [38]:
recipe_data.head()

Unnamed: 0,recipe,calories,carbohydrate,sugar,protein,category,servings,high_traffic
0,1,,,,,Pork,6,High
1,2,35.48,38.56,0.66,0.92,Potato,4,High
2,3,914.28,42.68,3.09,2.88,Breakfast,1,
3,4,97.03,30.56,38.63,0.02,Beverages,4,High
4,5,27.05,1.85,0.8,0.53,Beverages,4,


In [33]:
# Check data type of 'recepie' column
print("Data type of 'recipe':", recipe_data['recipe'].dtype)

# Check for missing values
print("Missing values in 'recipe':", recipe_data['recipe'].isnull().sum())

# Check number of unique IDs
unique_recipes = recipe_data['recipe'].nunique()
print("Unique 'recipe' IDs:", unique_recipes)

# Check min and max (should be a unique ID range)
print("Min recipe ID:", recipe_data['recipe'].min())
print("Max recipe ID:", recipe_data['recipe'].max())

Data type of 'recipe': int64
Missing values in 'recipe': 0
Unique 'recipe' IDs: 947
Min recipe ID: 1
Max recipe ID: 947


In [41]:
# check Data type of 'calories' column
print("Data type of 'calories':", recipe_data['calories'].dtype)

# Missing values
print("Missing values in 'calories':", recipe_data['calories'].isnull().sum())

# Basic statistics
print("Calories summary statistics:")
print(recipe_data['calories'].describe())

# Check for negative or zero values if that doesn't make sense for your domain
negative_calories = (recipe_data['calories'] < 0).sum()
print("Number of negative values in 'calories':", negative_calories)

Data type of 'calories': float64
Missing values in 'calories': 0
Calories summary statistics:
count     947.000000
mean      427.846019
std       441.673556
min         0.140000
25%       114.410000
50%       288.550000
75%       576.225000
max      3633.160000
Name: calories, dtype: float64
Number of negative values in 'calories': 0


In [42]:
# Calculate the median of the 'calories' column
median_calories = recipe_data['calories'].median()

# Fill missing values with the median
recipe_data['calories'] = recipe_data['calories'].fillna(median_calories)

# Verify that there are no missing values in the 'calories' column now
print("Missing values in 'calories' after imputation:", recipe_data['calories'].isnull().sum())

Missing values in 'calories' after imputation: 0


In [None]:
# check Data type of 'carbohydrate' column
print("=== Validation of 'carbohydrate' Column ===")
# Display the data type
print("Data type of 'carbohydrate':", recipe_data['carbohydrate'].dtype)

# Check for missing values
missing_carbs = recipe_data['carbohydrate'].isnull().sum()
print("Missing values in 'carbohydrate':", missing_carbs)

# Display summary statistics
print("Carbohydrate Summary Statistics:")
print(recipe_data['carbohydrate'].describe())

# Check for any negative values 
negative_carbs = (recipe_data['carbohydrate'] < 0).sum()
print("Number of negative values in 'carbohydrate':", negative_carbs)

# Compute the median of the 'carbohydrate' column
median_carbohydrate = recipe_data['carbohydrate'].median()
print("Median of 'carbohydrate':", median_carbohydrate)

# Fill missing values with the median value
recipe_data['carbohydrate'] = recipe_data['carbohydrate'].fillna(median_carbohydrate)

# Check that missing values have been handled
print("Missing values in 'carbohydrate' after imputation:", recipe_data['carbohydrate'].isnull().sum())

unique_carbs = recipe_data['carbohydrate'].unique()
print("Unique values in 'carbohydrate' after imputation:", unique_carbs)

=== Validation of 'carbohydrate' Column ===
Data type of 'carbohydrate': float64
Missing values in 'carbohydrate': 0
Carbohydrate Summary Statistics:
count    947.000000
mean      34.323464
std       42.836191
min        0.030000
25%        9.135000
50%       21.480000
75%       42.590000
max      530.420000
Name: carbohydrate, dtype: float64
Number of negative values in 'carbohydrate': 0
Median of 'carbohydrate': 21.48
Missing values in 'carbohydrate' after imputation: 0
Unique values in 'carbohydrate' after imputation: [2.1480e+01 3.8560e+01 4.2680e+01 3.0560e+01 1.8500e+00 3.4600e+00
 4.7950e+01 3.1700e+00 3.7800e+00 4.8540e+01 1.7630e+01 8.2700e+00
 2.3490e+01 1.1510e+01 6.6900e+00 2.6500e+00 1.8700e+00 1.0000e-01
 4.6500e+00 2.7550e+01 1.7440e+01 8.7910e+01 1.5300e+00 2.2350e+01
 5.1700e+01 1.3120e+01 6.2670e+01 3.3580e+01 5.2660e+01 2.3100e+01
 9.5000e+00 1.4700e+00 2.0710e+01 2.9100e+01 7.0070e+01 9.9820e+01
 1.5000e+00 4.6200e+00 1.4160e+01 4.4300e+00 4.7900e+00 1.7460e+01
 1.6

In [46]:
# check Data type of 'sugar' column
print("=== Validation of 'sugar' Column ===")
# Display the data type of 'sugar'
print("Data type of 'sugar':", recipe_data['sugar'].dtype)

# Check for missing values in 'sugar'
missing_sugar = recipe_data['sugar'].isnull().sum()
print("Missing values in 'sugar':", missing_sugar)

# Display summary statistics for 'sugar'
print("Sugar Summary Statistics:")
print(recipe_data['sugar'].describe())

# Check for any negative values in 'sugar'
negative_sugar = (recipe_data['sugar'] < 0).sum()
print("Number of negative values in 'sugar':", negative_sugar)

# Compute the median of the 'sugar' column
median_sugar = recipe_data['sugar'].median()
print("Median of 'sugar':", median_sugar)

# Fill missing values with the median value
recipe_data['sugar'] = recipe_data['sugar'].fillna(median_sugar)

# Check that missing values have been handled
print("Missing values in 'sugar' after imputation:", recipe_data['sugar'].isnull().sum())

unique_sugar = recipe_data['sugar'].unique()
print("Unique values in 'sugar' after imputation:", unique_sugar)

=== Validation of 'sugar' Column ===
Data type of 'sugar': float64
Missing values in 'sugar': 0
Sugar Summary Statistics:
count    947.000000
mean       8.799641
std       14.306785
min        0.010000
25%        1.795000
50%        4.550000
75%        9.285000
max      148.750000
Name: sugar, dtype: float64
Number of negative values in 'sugar': 0
Median of 'sugar': 4.55
Missing values in 'sugar' after imputation: 0
Unique values in 'sugar' after imputation: [4.5500e+00 6.6000e-01 3.0900e+00 3.8630e+01 8.0000e-01 1.6500e+00
 9.7500e+00 4.0000e-01 3.3700e+00 3.9900e+00 4.1000e+00 9.7800e+00
 1.5600e+00 1.0320e+01 1.0000e+01 4.6800e+00 2.9500e+00 3.9000e-01
 6.9000e-01 1.5100e+00 8.1600e+00 1.0491e+02 7.9500e+00 8.8800e+00
 1.1380e+01 2.7780e+01 1.8400e+00 2.6400e+00 1.7870e+01 6.2500e+00
 3.2830e+01 5.9200e+00 2.0000e-01 9.6300e+00 7.7500e+00 2.6200e+00
 1.8440e+01 1.0700e+01 1.0500e+00 2.0920e+01 3.3000e-01 7.7000e-01
 1.6200e+00 3.9540e+01 4.5800e+00 7.0000e-02 9.4000e+00 1.0800e+00
 

In [None]:
#check Data type of 'protein' column
print("=== Validation of 'protein' Column ===")

# Display the data type of 'protein'
print("Data type of 'protein':", recipe_data['protein'].dtype)

# Check for missing values in 'protein'
missing_protein = recipe_data['protein'].isnull().sum()
print("Missing values in 'protein':", missing_protein)

# Display summary statistics for 'protein'
print("Protein Summary Statistics:")
print(recipe_data['protein'].describe())

# Check for any negative values in 'protein' (assuming negative values are not valid)
negative_protein = (recipe_data['protein'] < 0).sum()
print("Number of negative values in 'protein':", negative_protein)

# Compute the median of the 'protein' column
median_protein = recipe_data['protein'].median()
print("Median of 'protein':", median_protein)

# Fill missing values with the median value
recipe_data['protein'] = recipe_data['protein'].fillna(median_protein)

# Check that missing values have been handled
print("Missing values in 'protein' after imputation:", recipe_data['protein'].isnull().sum())

unique_protein = recipe_data['protein'].unique()
print("Unique values in 'protein' after imputation:", unique_protein)

=== Validation of 'protein' Column ===
Data type of 'protein': float64
Missing values in 'protein': 52
Protein Summary Statistics:
count    895.000000
mean      24.149296
std       36.369739
min        0.000000
25%        3.195000
50%       10.800000
75%       30.200000
max      363.360000
Name: protein, dtype: float64
Number of negative values in 'protein': 0
Median of 'protein': 10.8
Missing values in 'protein' after imputation: 0
Unique values in 'protein' after imputation: [1.0800e+01 9.2000e-01 2.8800e+00 2.0000e-02 5.3000e-01 5.3930e+01
 4.6710e+01 3.2400e+01 3.7900e+00 1.1385e+02 9.1000e-01 1.1550e+01
 2.5700e+00 9.5700e+00 1.5170e+01 7.9710e+01 6.1070e+01 3.3170e+01
 3.4900e+00 8.9100e+00 1.0810e+01 1.1930e+01 2.6040e+01 1.2570e+01
 3.4790e+01 7.0300e+01 1.3850e+01 4.9600e+00 2.2014e+02 3.2320e+01
 4.5890e+01 8.2580e+01 2.9700e+00 6.2400e+00 2.2800e+00 1.9510e+01
 1.5570e+01 3.2620e+01 5.9000e+00 3.9690e+01 4.0640e+01 4.2900e+00
 8.7050e+01 1.1200e+01 3.4400e+00 1.7000e-01 7.92

In [50]:
#check Data type of 'category' cplumn
print("=== Validation of 'category' Column ===")

# 1. Convert "Chicken Breast" to "Chicken"
recipe_data['category'] = recipe_data['category'].replace("Chicken Breast", "Chicken")

# 2. Check data type of 'category'
print("Data type of 'category':", recipe_data['category'].dtype)

# 3. Check for missing values in 'category'
print("Missing values in 'category':", recipe_data['category'].isnull().sum())

# 4. Display unique values in 'category'
unique_categories = recipe_data['category'].unique()
print("Unique categories in 'category':", unique_categories)

# 5. Count frequency of each category to understand distribution
print("Value counts for 'category':")
print(recipe_data['category'].value_counts())

# 6. Convert 'category' to string type if not already 
recipe_data['category'] = recipe_data['category'].astype(str)
print(recipe_data['category'].head())
print("Data type after conversion:", recipe_data['category'].dtype)

=== Validation of 'category' Column ===
Data type of 'category': object
Missing values in 'category': 0
Unique categories in 'category': ['Pork' 'Potato' 'Breakfast' 'Beverages' 'One Dish Meal' 'Chicken'
 'Lunch/Snacks' 'Vegetable' 'Meat' 'Dessert']
Value counts for 'category':
category
Chicken          172
Breakfast        106
Beverages         92
Lunch/Snacks      89
Potato            88
Pork              84
Vegetable         83
Dessert           83
Meat              79
One Dish Meal     71
Name: count, dtype: int64
0         Pork
1       Potato
2    Breakfast
3    Beverages
4    Beverages
Name: category, dtype: object
Data type after conversion: object


In [51]:
# Convert the 'servings' column to string (in case there are any mixed values)
recipe_data['servings'] = recipe_data['servings'].astype(str)

# Extract the numeric part using a regex (this keeps only the digits)
recipe_data['servings'] = recipe_data['servings'].str.extract('(\d+)', expand=False)

# Convert the cleaned 'servings' column to integer
recipe_data['servings'] = pd.to_numeric(recipe_data['servings'], errors='coerce').astype(int)

# Verify the cleaning by checking unique values
print("Unique servings values after cleaning:", recipe_data['servings'].unique())

Unique servings values after cleaning: [6 4 1 2]


In [53]:
#check Data type of 'high_traffic' column
print("=== High Traffic Column - Initial Validation ===")
# Check the data type
print("Data type of 'high_traffic':", recipe_data['high_traffic'].dtype)

# Check for missing values
print("Missing values in 'high_traffic':", recipe_data['high_traffic'].isnull().sum())

# Display unique values and their frequency counts
print("Unique values in 'high_traffic':", recipe_data['high_traffic'].unique())
print("Value counts in 'high_traffic':")
print(recipe_data['high_traffic'].value_counts())

# Convert NaN values to 'Low'
recipe_data['high_traffic'] = recipe_data['high_traffic'].fillna('low')

#Validation After Cleaning
print("\n=== Cleaned High Traffic Column ===")
print("Missing values in 'high_traffic' after conversion:", recipe_data['high_traffic'].isnull().sum())
print("Unique values in 'high_traffic' after conversion:", recipe_data['high_traffic'].unique())
print("Value counts for 'high_traffic' after conversion:")
print(recipe_data['high_traffic'].value_counts())

=== High Traffic Column - Initial Validation ===
Data type of 'high_traffic': object
Missing values in 'high_traffic': 0
Unique values in 'high_traffic': ['High' 'Low']
Value counts in 'high_traffic':
high_traffic
High    574
Low     373
Name: count, dtype: int64

=== Cleaned High Traffic Column ===
Missing values in 'high_traffic' after conversion: 0
Unique values in 'high_traffic' after conversion: ['High' 'Low']
Value counts for 'high_traffic' after conversion:
high_traffic
High    574
Low     373
Name: count, dtype: int64


In [55]:
# Save cleaned dataset for modeling
recipe_data.to_csv("../data/processed/cleaned_recipes.csv", index=False)

print("Cleaned dataset saved to data/processed/cleaned_recipes.csv")

Cleaned dataset saved to data/processed/cleaned_recipes.csv
