# Environmental Impact of Food Production Analysis

## Overview
This notebook analyzes the environmental impact of food production, focusing on key metrics such as carbon emissions, water usage, land use, and biodiversity loss.

## Business Questions
1. What are the top food products with the highest environmental impact?
2. How does the environmental impact vary across different farming methods?
3. What is the relationship between water usage and carbon emissions in food production?
4. Which regions have the most sustainable food production practices?
5. How does land use efficiency correlate with environmental impact?
6. What are the trends in environmental impact over time for different food categories?
7. How can we optimize food production to minimize environmental impact?

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from pathlib import Path

# Set style for better visualizations
sns.set_theme()  # This replaces plt.style.use('seaborn')
sns.set_palette("husl")

# Display all columns
pd.set_option('display.max_columns', None)

# Create results directory if it doesn't exist
results_dir = Path("../results")
results_dir.mkdir(exist_ok=True)

## Data Loading and Initial Exploration

In [None]:
# Load the dataset
df = pd.read_csv('../data/Food_Production.csv')

# Display basic information about the dataset
print("Dataset Info:")
print(df.info())

# Display first few rows
print("\nFirst few rows:")
display(df.head())

## Data Preprocessing

In [None]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum())

# Basic statistics
print("\nBasic statistics:")
print(df.describe())

# Calculate total missing values percentage
total_missing = df.isnull().sum().sum()
total_cells = df.size
missing_percentage = (total_missing / total_cells) * 100
print(f"\nTotal missing values percentage: {missing_percentage:.2f}%")

## Business Question 1: Top Food Products with Highest Environmental Impact

In [None]:
# Sort by total emissions
top_emissions = df.nlargest(10, 'Total_emissions')

# Create static plot
plt.figure(figsize=(12, 6))
sns.barplot(data=top_emissions, x='Total_emissions', y='Food product')
plt.title('Top 10 Food Products by Total Emissions')
plt.xlabel('Total Emissions (kgCO₂eq)')
plt.tight_layout()
plt.savefig(results_dir / 'top_emissions.png')
plt.show()

# Create interactive plot
fig = px.bar(top_emissions, 
             x='Total_emissions', 
             y='Food product',
             title='Top 10 Food Products by Total Emissions')
fig.show()
# Save the static plot


# Display the data
print("\nTop 5 Products by Total Emissions:")
display(top_emissions[['Food product', 'Total_emissions']].head())

## Business Question 2: Environmental Impact Across Production Stages

In [None]:
# Select emission columns for production stages
emission_cols = ['Land use change', 'Animal Feed', 'Farm', 'Processing', 
                'Transport', 'Packging', 'Retail']

# Calculate mean emissions by stage
stage_emissions = df[emission_cols].mean()

# Create static pie chart
plt.figure(figsize=(10, 8))
plt.pie(stage_emissions, labels=emission_cols, autopct='%1.1f%%')
plt.title('Distribution of Emissions Across Production Stages')
plt.savefig(results_dir / 'emissions_by_stage.png')
plt.show()

# Create interactive pie chart
fig = px.pie(values=stage_emissions.values, 
             names=stage_emissions.index,
             title='Distribution of Emissions Across Production Stages')
fig.show()

# Display the data
print("\nEmissions Distribution by Stage:")
display(stage_emissions)

# Create bar chart for comparison
plt.figure(figsize=(12, 6))
sns.barplot(x=stage_emissions.values, y=stage_emissions.index)
plt.title('Emissions by Production Stage')
plt.xlabel('Average Emissions (kgCO₂eq)')
plt.tight_layout()
plt.savefig(results_dir / 'emissions_by_stage_bar.png')
plt.show()


## Business Question 3: Water Usage vs Carbon Emissions


In [None]:
print("Missing values in relevant columns:")
print(df[['Freshwater withdrawals per kilogram (liters per kilogram)', 
          'Total_emissions', 
          'Land use per kilogram (m² per kilogram)']].isnull().sum())

# Create a clean dataset without missing values for our analysis
df_clean = df.dropna(subset=['Freshwater withdrawals per kilogram (liters per kilogram)',
                            'Total_emissions',
                            'Land use per kilogram (m² per kilogram)'])

# Create static scatter plot
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df_clean,
                x='Freshwater withdrawals per kilogram (liters per kilogram)',
                y='Total_emissions',
                size='Land use per kilogram (m² per kilogram)',
                alpha=0.6)
plt.title('Water Usage vs Total Emissions')
plt.xlabel('Freshwater Withdrawals (liters/kg)')
plt.ylabel('Total Emissions (kgCO₂eq/kg)')
plt.savefig('../results/water_usage_vs_emissions.png')
plt.show()

# Create interactive scatter plot
fig = px.scatter(df_clean,
                x='Freshwater withdrawals per kilogram (liters per kilogram)',
                y='Total_emissions',
                size='Land use per kilogram (m² per kilogram)',
                hover_data=['Food product'],
                title='Water Usage vs Total Emissions')
fig.show()

# Calculate correlation
correlation = df_clean['Freshwater withdrawals per kilogram (liters per kilogram)'].corr(df_clean['Total_emissions'])
print(f"\nCorrelation between water usage and emissions: {correlation:.3f}")

# Create correlation heatmap for key environmental metrics
env_metrics = ['Total_emissions',
               'Freshwater withdrawals per kilogram (liters per kilogram)',
               'Land use per kilogram (m² per kilogram)']

plt.figure(figsize=(10, 8))
sns.heatmap(df_clean[env_metrics].corr(), 
            annot=True, 
            cmap='coolwarm', 
            center=0)
plt.title('Correlation Between Environmental Metrics')
plt.savefig('../results/environmental_correlations.png')
plt.show()

# Business Question 4: Land Use vs Environmental Impact
# How does land use efficiency correlate with environmental impact?


In [None]:
print("Missing values in land use related columns:")
print(df[['Land use per kilogram (m² per kilogram)',
          'Land use per 1000kcal (m² per 1000kcal)',
          'Land use per 100g protein (m² per 100g protein)',
          'Total_emissions']].isnull().sum())

# Create a clean dataset without missing values
df_clean = df.dropna(subset=['Land use per kilogram (m² per kilogram)',
                            'Total_emissions'])

# Create static scatter plot
plt.figure(figsize=(12, 8))
sns.scatterplot(data=df_clean,
                x='Land use per kilogram (m² per kilogram)',
                y='Total_emissions',
                size='Freshwater withdrawals per kilogram (liters per kilogram)',
                alpha=0.6)
plt.title('Land Use vs Total Emissions')
plt.xlabel('Land Use (m²/kg)')
plt.ylabel('Total Emissions (kgCO₂eq/kg)')
plt.savefig('../results/land_use_vs_emissions.png')
plt.show()

# Create interactive scatter plot
fig = px.scatter(df_clean,
                x='Land use per kilogram (m² per kilogram)',
                y='Total_emissions',
                size='Freshwater withdrawals per kilogram (liters per kilogram)',
                hover_data=['Food product'],
                title='Land Use vs Total Emissions')
fig.show()

# Calculate correlation
correlation = df_clean['Land use per kilogram (m² per kilogram)'].corr(df_clean['Total_emissions'])
print(f"\nCorrelation between land use and emissions: {correlation:.3f}")

# Create correlation heatmap for land use and environmental metrics
land_metrics = ['Total_emissions',
                'Land use per kilogram (m² per kilogram)',
                'Freshwater withdrawals per kilogram (liters per kilogram)',
                'Eutrophying emissions per kilogram (gPO₄eq per kilogram)']

plt.figure(figsize=(10, 8))
sns.heatmap(df_clean[land_metrics].corr(), 
            annot=True, 
            cmap='coolwarm', 
            center=0)
plt.title('Correlation Between Land Use and Environmental Metrics')
plt.savefig('../results/land_use_correlations.png')
plt.show()

# Top 5 products by land use efficiency (lowest land use per kg)
print("\nTop 5 products by land use efficiency (lowest land use per kg):")
land_efficiency = df_clean.nsmallest(5, 'Land use per kilogram (m² per kilogram)')[['Food product', 'Land use per kilogram (m² per kilogram)']]
print(land_efficiency)

# Top 5 products by highest land use
print("\nTop 5 products by highest land use per kg:")
land_inefficiency = df_clean.nlargest(5, 'Land use per kilogram (m² per kilogram)')[['Food product', 'Land use per kilogram (m² per kilogram)']]
print(land_inefficiency)

Business Question 5: Regional Analysis
# Which regions have the most sustainable food production practices?

Note: Since the dataset doesn't have explicit region information,
 I'll analyze sustainability patterns across different food categories
 and production methods

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Load only essential data with minimal processing
df = pd.read_csv('C:/Users/hbempong/Downloads/Environment-Impact-of-Food-Production-Analysis/data/Food_Production.csv')

# Quick categorization
animal_keywords = ['beef', 'lamb', 'mutton', 'cheese', 'milk', 'fish', 'poultry', 'pork', 'eggs']
df['Category'] = 'Plant-based'
df.loc[df['Food product'].str.lower().str.contains('|'.join(animal_keywords)), 'Category'] = 'Animal-based'

# Calculate metrics directly
animal_metrics = df[df['Category'] == 'Animal-based'][['Total_emissions', 'Land use per kilogram (m² per kilogram)', 'Freshwater withdrawals per kilogram (liters per kilogram)']].mean()
plant_metrics = df[df['Category'] == 'Plant-based'][['Total_emissions', 'Land use per kilogram (m² per kilogram)', 'Freshwater withdrawals per kilogram (liters per kilogram)']].mean()

# Print results
print("\nAnimal-based metrics:", animal_metrics)
print("\nPlant-based metrics:", plant_metrics)

# Create simple radar chart with matplotlib
metrics = ['Emissions', 'Land Use', 'Water Use']
animal_values = animal_metrics.values
plant_values = plant_metrics.values

# Normalize for comparison
max_values = np.maximum(animal_values, plant_values)
animal_norm = animal_values / max_values
plant_norm = plant_values / max_values

# Set up the radar chart
angles = np.linspace(0, 2*np.pi, len(metrics), endpoint=False).tolist()
angles += angles[:1]  # Close the loop

animal_norm = animal_norm.tolist()
animal_norm += animal_norm[:1]  # Close the loop
plant_norm = plant_norm.tolist()
plant_norm += plant_norm[:1]  # Close the loop

fig, ax = plt.subplots(figsize=(6, 6), subplot_kw=dict(polar=True))
ax.plot(angles, animal_norm, 'b-', linewidth=2, label='Animal-based')
ax.fill(angles, animal_norm, 'b', alpha=0.1)
ax.plot(angles, plant_norm, 'g-', linewidth=2, label='Plant-based')
ax.fill(angles, plant_norm, 'g', alpha=0.1)
ax.set_thetagrids(np.degrees(angles[:-1]), metrics)
ax.set_title('Environmental Impact Comparison')
ax.legend(loc='upper right')

plt.savefig('../results/category_comparison_radar.png', dpi=100)
print("Chart saved successfully!")

# Business Question 6: Trends in Environmental Impact
# What are the trends in environmental impact over time for different food categories?


In [None]:
# First, let's check the data quality for our analysis
print("Missing values in key metrics:")
print(df[['Total_emissions',
          'Land use per kilogram (m² per kilogram)',
          'Freshwater withdrawals per kilogram (liters per kilogram)',
          'Eutrophying emissions per kilogram (gPO₄eq per kilogram)']].isnull().sum())

# Create a clean dataset without missing values
df_clean = df.dropna(subset=['Total_emissions',
                            'Land use per kilogram (m² per kilogram)',
                            'Freshwater withdrawals per kilogram (liters per kilogram)',
                            'Eutrophying emissions per kilogram (gPO₄eq per kilogram)'])

# 1. Distribution Analysis
plt.figure(figsize=(15, 10))

# Plot 1: Emissions Distribution
plt.subplot(2, 2, 1)
sns.histplot(data=df_clean, x='Total_emissions', bins=20)
plt.title('Distribution of Total Emissions')
plt.xlabel('Total Emissions (kgCO₂eq/kg)')
plt.ylabel('Count')

# Plot 2: Land Use Distribution
plt.subplot(2, 2, 2)
sns.histplot(data=df_clean, x='Land use per kilogram (m² per kilogram)', bins=20)
plt.title('Distribution of Land Use')
plt.xlabel('Land Use (m²/kg)')
plt.ylabel('Count')

# Plot 3: Water Usage Distribution
plt.subplot(2, 2, 3)
sns.histplot(data=df_clean, x='Freshwater withdrawals per kilogram (liters per kilogram)', bins=20)
plt.title('Distribution of Water Usage')
plt.xlabel('Water Usage (liters/kg)')
plt.ylabel('Count')

# Plot 4: Eutrophying Emissions Distribution
plt.subplot(2, 2, 4)
sns.histplot(data=df_clean, x='Eutrophying emissions per kilogram (gPO₄eq per kilogram)', bins=20)
plt.title('Distribution of Eutrophying Emissions')
plt.xlabel('Eutrophying Emissions (gPO₄eq/kg)')
plt.ylabel('Count')

plt.tight_layout()
plt.savefig('../results/environmental_impact_distributions.png')
plt.show()

# 2. Category-wise Analysis
# Group products into categories
df_clean['Category'] = df_clean['Food product'].apply(lambda x: 
    'Animal-based' if any(animal in x.lower() for animal in ['beef', 'lamb', 'mutton', 'cheese', 'milk', 'fish', 'poultry', 'pork', 'eggs']) 
    else 'Plant-based')

# Create box plots for each metric by category
plt.figure(figsize=(15, 10))

# Plot 1: Emissions by Category
plt.subplot(2, 2, 1)
sns.boxplot(data=df_clean, x='Category', y='Total_emissions')
plt.title('Emissions by Category')
plt.xticks(rotation=45)

# Plot 2: Land Use by Category
plt.subplot(2, 2, 2)
sns.boxplot(data=df_clean, x='Category', y='Land use per kilogram (m² per kilogram)')
plt.title('Land Use by Category')
plt.xticks(rotation=45)

# Plot 3: Water Usage by Category
plt.subplot(2, 2, 3)
sns.boxplot(data=df_clean, x='Category', y='Freshwater withdrawals per kilogram (liters per kilogram)')
plt.title('Water Usage by Category')
plt.xticks(rotation=45)

# Plot 4: Eutrophying Emissions by Category
plt.subplot(2, 2, 4)
sns.boxplot(data=df_clean, x='Category', y='Eutrophying emissions per kilogram (gPO₄eq per kilogram)')
plt.title('Eutrophying Emissions by Category')
plt.xticks(rotation=45)

plt.tight_layout()
plt.savefig('../results/category_impact_trends.png')
plt.show()

# 3. Correlation Analysis
# Create correlation matrix for key metrics
metrics = ['Total_emissions',
          'Land use per kilogram (m² per kilogram)',
          'Freshwater withdrawals per kilogram (liters per kilogram)',
          'Eutrophying emissions per kilogram (gPO₄eq per kilogram)']

plt.figure(figsize=(10, 8))
sns.heatmap(df_clean[metrics].corr(), 
            annot=True, 
            cmap='coolwarm', 
            center=0)
plt.title('Correlation Between Environmental Metrics')
plt.savefig('../results/environmental_metrics_correlations.png')
plt.show()

# 4. Summary Statistics
print("\nSummary Statistics by Category:")
print("\nAnimal-based Products:")
print(df_clean[df_clean['Category'] == 'Animal-based'][metrics].describe())
print("\nPlant-based Products:")
print(df_clean[df_clean['Category'] == 'Plant-based'][metrics].describe())

# 5. Top and Bottom Performers
print("\nTop 5 Products by Environmental Impact (Total Emissions):")
print(df_clean.nlargest(5, 'Total_emissions')[['Food product', 'Category', 'Total_emissions']])

print("\nTop 5 Most Environmentally Efficient Products (Lowest Total Emissions):")
print(df_clean.nsmallest(5, 'Total_emissions')[['Food product', 'Category', 'Total_emissions']])

# Business Question 7: Optimization Recommendations
# How can we optimize food production to minimize environmental impact?


In [None]:
# First, let's check the data quality for our analysis
print("Missing values in key metrics:")
print(df[['Total_emissions',
          'Land use per kilogram (m² per kilogram)',
          'Freshwater withdrawals per kilogram (liters per kilogram)',
          'Eutrophying emissions per kilogram (gPO₄eq per kilogram)']].isnull().sum())

# Create a clean dataset without missing values
df_clean = df.dropna(subset=['Total_emissions',
                            'Land use per kilogram (m² per kilogram)',
                            'Freshwater withdrawals per kilogram (liters per kilogram)',
                            'Eutrophying emissions per kilogram (gPO₄eq per kilogram)'])

# 1. Identify High-Impact Products
# Calculate environmental impact score (normalized combination of key metrics)
metrics = ['Total_emissions',
          'Land use per kilogram (m² per kilogram)',
          'Freshwater withdrawals per kilogram (liters per kilogram)',
          'Eutrophying emissions per kilogram (gPO₄eq per kilogram)']

# Normalize each metric
for metric in metrics:
    df_clean[f'{metric}_normalized'] = (df_clean[metric] - df_clean[metric].min()) / (df_clean[metric].max() - df_clean[metric].min())

# Calculate overall impact score
df_clean['Environmental_Impact_Score'] = df_clean[[f'{metric}_normalized' for metric in metrics]].mean(axis=1)

# Identify top 5 products with highest environmental impact
print("\nTop 5 Products with Highest Environmental Impact:")
high_impact_products = df_clean.nlargest(5, 'Environmental_Impact_Score')[['Food product', 'Environmental_Impact_Score'] + metrics]
print(high_impact_products)

# 2. Optimization Potential Analysis
# Calculate potential improvements (assuming 20% reduction is achievable)
optimization_potential = high_impact_products.copy()
for metric in metrics:
    optimization_potential[f'{metric}_Potential_Reduction'] = optimization_potential[metric] * 0.2

# Create visualization of optimization potential
plt.figure(figsize=(15, 8))
x = np.arange(len(optimization_potential['Food product']))
width = 0.35

plt.bar(x - width/2, optimization_potential['Total_emissions'], width, label='Current Emissions')
plt.bar(x + width/2, optimization_potential['Total_emissions'] - optimization_potential['Total_emissions_Potential_Reduction'], 
        width, label='Potential Emissions (20% reduction)')

plt.xlabel('Food Product')
plt.ylabel('Emissions (kgCO₂eq/kg)')
plt.title('Current vs Potential Emissions for High-Impact Products')
plt.xticks(x, optimization_potential['Food product'], rotation=45)
plt.legend()
plt.tight_layout()
plt.savefig('../results/optimization_potential.png')
plt.show()

# 3. Production Stage Analysis for Optimization
# Calculate contribution of each stage to total emissions
stage_columns = ['Land use change', 'Animal Feed', 'Farm', 'Processing', 'Transport', 'Packging', 'Retail']
stage_contributions = df_clean[stage_columns].mean()

# Create pie chart of stage contributions
plt.figure(figsize=(10, 8))
plt.pie(stage_contributions, labels=stage_contributions.index, autopct='%1.1f%%')
plt.title('Contribution of Each Stage to Total Emissions')
plt.savefig('../results/production_stage_contributions.png')
plt.show()

# 4. Optimization Recommendations
print("\nOptimization Recommendations by Production Stage:")

# Farm Stage (highest impact)
print("\n1. Farm Stage Optimization:")
print("- Implement precision agriculture techniques")
print("- Optimize fertilizer and pesticide use")
print("- Improve soil management practices")
print("- Adopt sustainable farming methods")

# Land Use Stage
print("\n2. Land Use Optimization:")
print("- Implement crop rotation")
print("- Optimize land utilization")
print("- Restore degraded land")
print("- Protect natural habitats")

# Water Usage
print("\n3. Water Usage Optimization:")
print("- Implement efficient irrigation systems")
print("- Optimize water recycling")
print("- Reduce water waste")
print("- Improve water management practices")

# Processing and Transport
print("\n4. Processing and Transport Optimization:")
print("- Optimize supply chain efficiency")
print("- Reduce food waste")
print("- Improve packaging efficiency")
print("- Implement energy-efficient processing")

# 5. Category-Specific Recommendations
print("\nCategory-Specific Recommendations:")

# Animal-based products
print("\nFor Animal-based Products:")
print("- Improve feed efficiency")
print("- Optimize breeding practices")
print("- Implement better waste management")
print("- Reduce methane emissions")

# Plant-based products
print("\nFor Plant-based Products:")
print("- Optimize crop yields")
print("- Improve pest management")
print("- Enhance soil health")
print("- Reduce post-harvest losses")

# 6. Cost-Benefit Analysis
# Calculate potential environmental impact reduction
total_potential_reduction = optimization_potential[metrics].sum().sum() * 0.2
print(f"\nTotal Potential Environmental Impact Reduction: {total_potential_reduction:.2f} units")

# Create summary visualization
plt.figure(figsize=(12, 6))
metrics_reduction = optimization_potential[[f'{metric}_Potential_Reduction' for metric in metrics]].sum()
plt.bar(metrics, metrics_reduction)
plt.title('Potential Reduction in Environmental Impact by Metric')
plt.xticks(rotation=45)
plt.ylabel('Potential Reduction')
plt.tight_layout()
plt.savefig('../results/potential_reduction_summary.png')
plt.show()


=== Insight Validation ===

1. Data Quality Validation:
Missing Values:
Food product                                                                0
Land use change                                                             0
Animal Feed                                                                 0
Farm                                                                        0
Processing                                                                  0
Transport                                                                   0
Packging                                                                    0
Retail                                                                      0
Total_emissions                                                             0
Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal)                   10
Eutrophying emissions per kilogram (gPO₄eq per kilogram)                    5
Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein)

NameError: name 'stats' is not defined

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats  # Add this import
import warnings
warnings.filterwarnings('ignore')  # Suppress warnings

def validate_insights(df):
    print("\n=== Insight Validation ===")
    
    # 1. Data Quality Check
    print("\n1. Data Quality Validation:")
    print("Missing Values:")
    print(df.isnull().sum())
    print("\nData Types:")
    print(df.dtypes)
    
    # 2. Statistical Significance Tests
    print("\n2. Statistical Significance Tests:")
    
    # Categorize data
    df['Category'] = np.where(df['Food product'].str.contains('beef|lamb|mutton|cheese|milk|fish|poultry|pork|eggs', case=False), 
                            'Animal-based', 'Plant-based')
    
    # Compare categories
    animal_data = df[df['Category'] == 'Animal-based']
    plant_data = df[df['Category'] == 'Plant-based']
    
    print("\nCategory Comparison:")
    for metric in ['Total_emissions', 'Land use per kilogram (m² per kilogram)', 
                  'Freshwater withdrawals per kilogram (liters per kilogram)']:
        t_stat, p_value = stats.ttest_ind(animal_data[metric], plant_data[metric])
        print(f"\n{metric}:")
        print(f"Animal-based mean: {animal_data[metric].mean():.2f}")
        print(f"Plant-based mean: {plant_data[metric].mean():.2f}")
        print(f"T-statistic: {t_stat:.4f}")
        print(f"P-value: {p_value:.4f}")
        print(f"Significant difference: {'Yes' if p_value < 0.05 else 'No'}")
    
    # 3. Outlier Detection
    print("\n3. Outlier Detection:")
    for metric in ['Total_emissions', 'Land use per kilogram (m² per kilogram)', 
                  'Freshwater withdrawals per kilogram (liters per kilogram)']:
        Q1 = df[metric].quantile(0.25)
        Q3 = df[metric].quantile(0.75)
        IQR = Q3 - Q1
        outliers = df[((df[metric] < (Q1 - 1.5 * IQR)) | (df[metric] > (Q3 + 1.5 * IQR)))]
        print(f"\n{metric} outliers:")
        print(f"Number of outliers: {len(outliers)}")
        if len(outliers) > 0:
            print("Outlier products:")
            print(outliers[['Food product', metric]])
    
    # 4. Correlation Analysis
    print("\n4. Correlation Analysis:")
    correlation_matrix = df[['Total_emissions', 'Land use per kilogram (m² per kilogram)', 
                           'Freshwater withdrawals per kilogram (liters per kilogram)']].corr()
    print(correlation_matrix)
    
    # 5. Distribution Analysis
    print("\n5. Distribution Analysis:")
    for metric in ['Total_emissions', 'Land use per kilogram (m² per kilogram)', 
                  'Freshwater withdrawals per kilogram (liters per kilogram)']:
        print(f"\n{metric} distribution:")
        print(df[metric].describe())
    
    # 6. Visual Validation
    plt.figure(figsize=(15, 10))
    
    # Box plots for each metric by category
    for i, metric in enumerate(['Total_emissions', 'Land use per kilogram (m² per kilogram)', 
                              'Freshwater withdrawals per kilogram (liters per kilogram)'], 1):
        plt.subplot(2, 2, i)
        sns.boxplot(data=df, x='Category', y=metric)
        plt.title(f'{metric} by Category')
    
    # Correlation heatmap
    plt.subplot(2, 2, 4)
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
    plt.title('Correlation Heatmap')
    
    plt.tight_layout()
    plt.savefig('../results/insight_validation.png')
    plt.close()

# Load the data
df = pd.read_csv('C:/Users/hbempong/Downloads/Environment-Impact-of-Food-Production-Analysis/data/Food_Production.csv')

# Run validation
validate_insights(df)


=== Insight Validation ===

1. Data Quality Validation:
Missing Values:
Food product                                                                0
Land use change                                                             0
Animal Feed                                                                 0
Farm                                                                        0
Processing                                                                  0
Transport                                                                   0
Packging                                                                    0
Retail                                                                      0
Total_emissions                                                             0
Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal)                   10
Eutrophying emissions per kilogram (gPO₄eq per kilogram)                    5
Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein)

#Actionable Recommendations

In [5]:
def generate_detailed_recommendations(df, insights):
    print("\n=== Detailed Actionable Recommendations ===")
    
    # Calculate specific metrics for recommendations
    avg_animal_emissions = df[df['Category'] == 'Animal-based']['Total_emissions'].mean()
    avg_plant_emissions = df[df['Category'] == 'Plant-based']['Total_emissions'].mean()
    emissions_reduction_potential = avg_animal_emissions - avg_plant_emissions
    
    # 1. For Policymakers
    print("\n1. Policy Recommendations:")
    print(f"- Implement carbon pricing for food production, with higher rates for products emitting > {avg_animal_emissions:.2f} kg CO2eq/kg")
    print("- Create subsidies for sustainable farming practices, particularly for products with emissions below the industry average")
    print("- Develop water usage regulations, especially for products with high water consumption")
    print("- Support research in sustainable farming technologies")
    
    # 2. For Food Producers
    print("\n2. Industry Recommendations:")
    print("- Shift production towards plant-based alternatives, which show significantly lower emissions compared to animal-based products")
    print("- Focus on improving efficiency in these high-impact areas:")
    for _, row in insights['top_emitters'].iterrows():
        print(f"  * {row['Food product']}: {row['Total_emissions']:.2f} kg CO2eq/kg")
    print("- Invest in water-efficient farming technologies")
    print("- Consider vertical farming for high-impact products")
    
    # 3. For Consumers
    print("\n3. Consumer Recommendations:")
    print("- Choose these sustainable alternatives:")
    for _, row in insights['most_efficient'].iterrows():
        print(f"  * {row['Food product']}: {row['Land use per kilogram (m² per kilogram)']:.2f} m²/kg")
    print("- Reduce consumption of high-impact products:")
    for _, row in insights['top_emitters'].iterrows():
        print(f"  * {row['Food product']}: {row['Total_emissions']:.2f} kg CO2eq/kg")
    
    # 4. Environmental Impact Projections
    print("\n4. Potential Environmental Impact:")
    print("- Significant reduction in emissions through adoption of sustainable farming practices")
    print("- Focus on water-intensive products for immediate impact")
    
    # 5. Implementation Timeline
    print("\n5. Implementation Timeline:")
    print("Short-term (0-6 months):")
    print("- Begin consumer education campaigns")
    print("- Start water usage monitoring")
    print("Medium-term (6-12 months):")
    print("- Implement efficiency improvements")
    print("- Develop sustainable farming practices")
    print("Long-term (1-2 years):")
    print("- Scale up sustainable production")
    print("- Implement comprehensive monitoring systems")

# Generate detailed recommendations
generate_detailed_recommendations(df, insights)


=== Detailed Actionable Recommendations ===

1. Policy Recommendations:
- Implement carbon pricing for food production, with higher rates for products emitting > 16.21 kg CO2eq/kg
- Create subsidies for sustainable farming practices, particularly for products with emissions below the industry average
- Develop water usage regulations, especially for products with high water consumption
- Support research in sustainable farming technologies

2. Industry Recommendations:
- Shift production towards plant-based alternatives, which show significantly lower emissions compared to animal-based products
- Focus on improving efficiency in these high-impact areas:
  * Beef (beef herd): 59.60 kg CO2eq/kg
  * Lamb & Mutton: 24.50 kg CO2eq/kg
  * Cheese: 21.20 kg CO2eq/kg
  * Beef (dairy herd): 21.10 kg CO2eq/kg
  * Dark Chocolate: 18.70 kg CO2eq/kg
- Invest in water-efficient farming technologies
- Consider vertical farming for high-impact products

3. Consumer Recommendations:
- Choose these sust