# Food Nutrition Dataset: Data Preprocessing & EDA Report

This notebook presents a comprehensive workflow for data preprocessing and exploratory data analysis (EDA) on a food nutrition dataset. The steps include data cleaning, visualization, feature engineering, and feature importance analysis to prepare the data for modeling.

## 1. Importing Required Libraries

We begin by importing essential libraries for data manipulation, visualization, and preprocessing.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

## 2. Loading the Dataset

Load the food nutrition dataset and display its first few rows and shape.

In [None]:
# Replace 'food_data.csv' with your actual file path if needed
df = pd.read_csv("food_data.csv")
df.head()

In [None]:
df.shape

## 3. Initial Data Exploration

Get an overview of the dataset structure and data types.

In [None]:
df.info()

## 4. Descriptive Statistics

Generate descriptive statistics for both numerical and categorical columns.

In [None]:
# Numerical columns
df.describe()

In [None]:
# Categorical columns
df.describe(exclude='float')

## 5. Data Visualization

Visualize average calories by meal type and explore distributions of continuous and categorical variables.

In [None]:
plt.figure(figsize=(10,4))
sns.barplot(x=df["Meal_Type"], y=df["Calories"])
plt.title("Average Calories by Meal Type")
plt.xlabel("Meal Type")
plt.ylabel("Average Calories")
plt.xticks(rotation=45)
plt.show()

In [None]:
# Distribution of continuous variables
for col in df.select_dtypes(include=[np.number]).columns:
    plt.figure(figsize=(8, 4))
    sns.kdeplot(x=df[col], fill=True)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
# Distribution of categorical variables
for col in df.select_dtypes(include=['object', 'bool']).columns:
    plt.figure(figsize=(6, 4))
    sns.countplot(x=df[col])
    plt.title(f'Count of {col}')
    plt.xticks(rotation=45)
    plt.show()

## 6. Null Values Treatment

Identify missing values, calculate their percentage, and drop rows with missing values if the proportion is small.

In [None]:
# Count of null values
print(df.isnull().sum())
# Percentage of null values
print(df.isnull().mean() * 100)

In [None]:
# Drop rows with missing values if proportion is small
df.dropna(inplace=True)
df.info()

## 7. Categorical and Numerical Variable Identification

Identify and print the names and counts of categorical and numerical variables.

In [None]:
num_vars = df.select_dtypes(include=['float64', 'int64']).columns
cat_vars = df.select_dtypes(include=['object', 'bool']).columns
print("Numerical variables:", list(num_vars), "Count:", len(num_vars))
print("Categorical variables:", list(cat_vars), "Count:", len(cat_vars))

## 8. Distribution Analysis

Plot the distributions of continuous and categorical variables.

In [None]:
# Continuous variables
for col in num_vars:
    plt.figure(figsize=(8, 4))
    sns.kdeplot(x=df[col], fill=True)
    plt.title(f'Distribution of {col}')
    plt.show()

In [None]:
# Categorical variables
for col in cat_vars:
    plt.figure(figsize=(6, 4))
    sns.countplot(x=df[col])
    plt.title(f'Count of {col}')
    plt.xticks(rotation=45)
    plt.show()

## 9. Outlier Detection and Treatment

Detect outliers in numerical features using boxplots, and cap outliers using the IQR method.

In [None]:
# Boxplots for outlier detection
for col in num_vars:
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df[col], color='red')
    plt.title(f'Boxplot of {col}')
    plt.show()

In [None]:
# Capping outliers using IQR
def cap_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return series.clip(lower, upper)

df[num_vars] = df[num_vars].apply(cap_outliers)

## 10. Duplicate Entry Removal

Check for duplicate rows and remove them from the dataset.

In [None]:
print("Number of duplicate rows:", df.duplicated().sum())
df.drop_duplicates(inplace=True)
print("Number of duplicate rows after removal:", df.duplicated().sum())

## 11. Standardizing Numerical Features

Apply StandardScaler to numerical features to standardize them for further analysis or modeling.

In [None]:
scaler = StandardScaler()
df[num_vars] = scaler.fit_transform(df[num_vars])

## 12. Categorical Variable Encoding

Encode boolean categorical variables as 0/1 and apply label encoding to other categorical variables.

In [None]:
# Boolean encoding
for col in df.select_dtypes(include='bool').columns:
    df[col] = df[col].astype(int)

# Label encoding for object type categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
    df[col] = le.fit_transform(df[col])

## 13. Feature Engineering

Engineer new features or transform existing ones as needed for modeling.

In [None]:
# Example: No new features engineered in this workflow, but this is where you would add them.
# df['New_Feature'] = df['Some_Column'] * 2

## 14. Feature Importance Analysis

Use RandomForestClassifier to determine feature importances, visualize the top features, and select the most relevant ones for modeling.

In [None]:
# Assuming 'Food_Name' is the target variable for demonstration
X = df.drop(columns=['Food_Name'])
y = df['Food_Name']

rf = RandomForestClassifier(n_estimators=250, random_state=42)
rf.fit(X, y)
importances = rf.feature_importances_
feature_names = X.columns

# Create a DataFrame for visualization
feat_imp_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances * 100
}).sort_values('Importance', ascending=False)

In [None]:
# Visualize top features
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feat_imp_df.head(9), palette='viridis')
plt.title('Top 9 Feature Importance Scores')
plt.xlabel('Importance Score (%)')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

---

**Summary:**  
This notebook covered the end-to-end preprocessing and EDA workflow for the food nutrition dataset, including data cleaning, visualization, feature engineering, and feature importance analysis. The processed data is now ready for model building and further analysis.