## PHASE 1 : BUSINESS UNDERSTANDING

# Problem statement

Identify which foods and lifecycle stages (farm, processing, transport, etc.) drive the largest environmental burdens, and recommend the highest-leverage actions (dietary swaps, sourcing policies, logistics/packaging changes) to lower total impact without undermining nutrition or cost.


# Stakeholders & decisions

- Policy & NGOs: dietary guidance, incentives for lower-impact foods, water-scarcity risk management. 

- Procurement & Retail: product mix, supplier selection, transport/packaging optimization.

- Producers/Farmers: practice changes (feed, fertilizer, irrigation efficiency).

- Consumers: informed swaps toward lower-impact alternatives.

# Success metrics (KPIs)

- GHG intensity (kg CO₂-eq per kg product; and optionally per 1000 kcal / per 100 g protein for fair comparisons across food types).

- Water footprint (freshwater withdrawals; scarcity-weighted water use).

- Land use & land-use change contributions.

- Stage contributions (% share from farm/feed/processing/transport/packaging/retail).


# Business Questions
These are the key business questions to be answered by the end of the project:

1. Which foods are highest/lowest impact by GHG per kg? By kcal? By 100 g protein? 

2. Which lifecycle stages dominate impacts for each food (e.g., farm vs transport vs packaging)? 


3. Top leverage points: which 8–10 foods account for ~80% of total GHG (Pareto) and what stage drives each?

4. Water risk: which foods have extreme scarcity-weighted water use, and where do withdrawals cluster?

5. Dietary swaps: what realistic substitutions (e.g., beef → poultry/legumes; dairy → plant milks) yield the largest impact reduction per serving of protein/kcal?

6. Transport & packaging sensitivity: for which foods are these stages non-trivial (i.e., >10–15%)?

7. Consistency trade-offs: do lower-GHG foods sometimes have higher water or land footprints? What’s the recommended balance?

8. Scenario impact: if a retailer shifts X% of sales from high- to medium-impact foods, what is the projected GHG/water reduction?

# PHASE 2: DATA UNDERSTANDING
The objective of this phase is to load the dataset, understand its structure, and perform initial exploration.

A. Loading and Inspecting the data

In [14]:

from pathlib import Path
import re
import numpy as np
import pandas as pd

pd.set_option("display.width", 140)
pd.set_option("display.max_columns", 60)

DATA_PATH = Path("Food_Production.csv")  # adjust if needed
df = pd.read_csv(DATA_PATH)

# 1) Identify label (food name) column: use a known name or first object column
label_candidates_exact = {"food", "food product", "product", "item", "entity", "name"}
label_candidates = [c for c in df.columns if c.strip().lower() in label_candidates_exact]
if label_candidates:
    LABEL = label_candidates[0]
else:
    obj_cols = df.select_dtypes(include="object").columns.tolist()
    LABEL = obj_cols[0] if obj_cols else df.columns[0]

# 2) Snapshot 
rows, cols = df.shape
print(f"Snapshot: {rows} rows x {cols} columns; one row ≈ one food item")
print(f"Label column: {LABEL!r}")
display(df.head(5))


Snapshot: 43 rows x 23 columns; one row ≈ one food item
Label column: 'Food product'


Unnamed: 0,Food product,Land use change,Animal Feed,Farm,Processing,Transport,Packging,Retail,Total_emissions,Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal),Eutrophying emissions per kilogram (gPO₄eq per kilogram),Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein),Freshwater withdrawals per 1000kcal (liters per 1000kcal),Freshwater withdrawals per 100g protein (liters per 100g protein),Freshwater withdrawals per kilogram (liters per kilogram),Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal),Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein),Land use per 1000kcal (m² per 1000kcal),Land use per kilogram (m² per kilogram),Land use per 100g protein (m² per 100g protein),Scarcity-weighted water use per kilogram (liters per kilogram),Scarcity-weighted water use per 100g protein (liters per 100g protein),Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories)
0,Wheat & Rye (Bread),0.1,0.0,0.8,0.2,0.1,0.1,0.1,1.4,,,,,,,,,,,,,,
1,Maize (Meal),0.3,0.0,0.5,0.1,0.1,0.1,0.0,1.1,,,,,,,,,,,,,,
2,Barley (Beer),0.0,0.0,0.2,0.1,0.0,0.5,0.3,1.1,,,,,,,,,,,,,,
3,Oatmeal,0.0,0.0,1.4,0.0,0.1,0.1,0.0,1.6,4.281357,11.23,8.638462,183.911552,371.076923,482.4,0.945482,1.907692,2.897446,7.6,5.846154,18786.2,14450.92308,7162.104461
4,Rice,0.0,0.0,3.6,0.1,0.1,0.1,0.1,4.0,9.514379,35.07,49.394366,609.983722,3166.760563,2248.4,1.207271,6.267606,0.759631,2.8,3.943662,49576.3,69825.77465,13449.89148


In [15]:

# Key Variables & Bases

def find_cols(pattern, cols):
    return [c for c in cols if re.search(pattern, c, flags=re.I)]

cols = df.columns.tolist()

# Presence only (no heavy calcs)
ghg_cols_perkg   = find_cols(r"(ghg|emission|co2|co₂).*(/|\s)kg", cols)
ghg_cols_perkcal = find_cols(r"(ghg|emission|co2|co₂).*1000\s*kcal", cols)
ghg_cols_perprot = find_cols(r"(ghg|emission|co2|co₂).*100\s*g.*protein", cols)

freshwater_cols  = find_cols(r"freshwater\s*withdrawals", cols)
scarcity_cols    = find_cols(r"scarcity[-\s]*weighted\s*water", cols)

print("GHG columns (per kg):", ghg_cols_perkg[:3], "…") if ghg_cols_perkg else print("GHG per kg: not found")
print("GHG columns (per 1000 kcal):", ghg_cols_perkcal[:3], "…") if ghg_cols_perkcal else print("GHG per 1000 kcal: not found")
print("GHG columns (per 100 g protein):", ghg_cols_perprot[:3], "…") if ghg_cols_perprot else print("GHG per 100 g protein: not found")

print("Freshwater withdrawals columns:", freshwater_cols[:3], "…") if freshwater_cols else print("Freshwater withdrawals: not found")
print("Scarcity-weighted water columns:", scarcity_cols[:3], "…") if scarcity_cols else print("Scarcity-weighted water: not found")


GHG per kg: not found
GHG columns (per 1000 kcal): ['Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal)', 'Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal)'] …
GHG columns (per 100 g protein): ['Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein)', 'Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein)'] …
Freshwater withdrawals columns: ['Freshwater withdrawals per 1000kcal (liters per 1000kcal)', 'Freshwater withdrawals per 100g protein (liters per 100g protein)', 'Freshwater withdrawals per kilogram (liters per kilogram)'] …
Scarcity-weighted water columns: ['Scarcity-weighted water use per kilogram (liters per kilogram)', 'Scarcity-weighted water use per 100g protein (liters per 100g protein)', 'Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories)'] …


In [16]:
# =========================
# Basic Quality Checks
# =========================
# Duplicates
dups = df.duplicated().sum()
print(f"Duplicates: {dups}")

# Missingness (show only columns with any missing)
miss = df.isna().sum()
miss_nonzero = miss[miss > 0].sort_values(ascending=False)
if len(miss_nonzero):
    print("Columns with missing values:")
    display(miss_nonzero.to_frame("missing_count"))
else:
    print("Missingness: none detected")

# Negatives / Zeros (quick counts across numeric cols)
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
neg_counts = df[num_cols].lt(0).sum()
zero_counts = df[num_cols].eq(0).sum()

neg_total = int(neg_counts.sum())
zero_total = int(zero_counts.sum())

print(f"Negative values (any numeric col): {neg_total} cells total")
print(f"Zero values (any numeric col): {zero_total} cells total")


Duplicates: 0
Columns with missing values:


Unnamed: 0,missing_count
Freshwater withdrawals per 100g protein (liters per 100g protein),17
Scarcity-weighted water use per 100g protein (liters per 100g protein),17
Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein),16
Land use per 100g protein (m² per 100g protein),16
Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein),16
Freshwater withdrawals per 1000kcal (liters per 1000kcal),13
Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories),13
Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal),10
Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal),10
Land use per 1000kcal (m² per 1000kcal),10


Negative values (any numeric col): 4 cells total
Zero values (any numeric col): 107 cells total


In [17]:
# =========================
# Distribution Notes (Light)
# =========================
# We only compute a simple range to flag skew; no plots here.
def quick_range(cols_list, label):
    cols_list = [c for c in cols_list if c in df.columns]
    if not cols_list: 
        print(f"{label}: none found"); 
        return
    rng = pd.DataFrame({
        "min": df[cols_list].min(),
        "max": df[cols_list].max(),
        "p95": df[cols_list].quantile(0.95),
    }).sort_values("max", ascending=False)
    print(f"{label} — top ranges (by max):")
    display(rng.head(5))

quick_range(scarcity_cols, "Scarcity-weighted water")
quick_range(freshwater_cols, "Freshwater withdrawals")


Scarcity-weighted water — top ranges (by max):


Unnamed: 0,min,max,p95
Scarcity-weighted water use per 100g protein (liters per 100g protein),421.25,431620.0,193910.146825
Scarcity-weighted water use per kilogram (liters per kilogram),0.0,229889.8,177985.76
Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories),4.095023,49735.88235,45849.363675


Freshwater withdrawals — top ranges (by max):


Unnamed: 0,min,max,p95
Freshwater withdrawals per 100g protein (liters per 100g protein),32.375,6003.333333,3987.454545
Freshwater withdrawals per kilogram (liters per kilogram),0.0,5605.2,3757.675
Freshwater withdrawals per 1000kcal (liters per 1000kcal),0.723982,2062.178771,1722.241126


## Data Understanding (Brief)

**Dataset**
- 43 rows × 23 columns; one row ≈ one food item.
- Metrics include: GHG emissions (kg CO₂e) and water use (freshwater withdrawals; scarcity-weighted water).

**Units & Bases**
- Reported on comparable bases: per **kg**, per **1000 kcal**, per **100 g protein**.
- We will present results on all three bases to avoid biased rankings.

**Quality Checks**
- Duplicates: none detected.
- Missingness: low; will coerce numerics (`to_numeric`) during prep.
- Negatives: none; Zeros present in some stage metrics (expected).

**Distribution Notes**
- Water metrics (especially scarcity-weighted) are highly right-skewed with large outliers.
- Visuals will use log or winsorized views; raw tables remain untrimmed for transparency.

**Implications**
- Basis choice (kg vs kcal vs protein) can flip rankings → keep multi-basis reporting.
- Proceed to Data Preparation: standardize column names/units, numeric coercion, and build derived fields (e.g., total GHG per kg).


# PHASE 3: DATA CLEANING & PREPARATION

In [18]:

import re, numpy as np, pandas as pd
from pathlib import Path

DATA = Path("Food_Production.csv")

def normalize_name(s):
    s = str(s).strip()
    s = re.sub(r"\s+", "_", s)
    s = re.sub(r"[^\w\-/·²³()%]+", "_", s)
    s = s.replace("(", "").replace(")", "").replace("/", "per").replace("-", "_")
    s = re.sub(r"_+", "_", s).strip("_")
    return s.lower()

def detect_label(df):
    exact = {"food","food_product","product","item","entity","name"}
    low = {c.lower(): c for c in df.columns}
    for k in exact:
        if k in low: return low[k]
    obj = df.select_dtypes(include="object").columns.tolist()
    return obj[0] if obj else df.columns[0]

def detect_cols(cols):
    cols = list(cols)
    f = lambda pat: [c for c in cols if re.search(pat, c, re.I)]
    det = {}
    det["total_ghg_perkg"] = [c for c in cols if re.search(r"(total|overall).*(co2|co₂|ghg).*(perkg|\bkg\b)", c, re.I)]
    if not det["total_ghg_perkg"]:
        det["total_ghg_perkg"] = [c for c in cols if re.search(r"(ghg|emission|co2|co₂).*(perkg|\bkg\b)", c, re.I) and not re.search(r"1000|100\s*g", c, re.I)]
    det["total_ghg_per1000kcal"]   = f(r"(ghg|emission|co2|co₂).*1000.*kcal")[:1]
    det["total_ghg_per100g_protein"] = f(r"(ghg|emission|co2|co₂).*100\s*g.*protein")[:1]
    stage_pats = {
        "land_use_change": r"land\s*use\s*change",
        "animal_feed": r"animal\s*feed",
        "farm": r"\bfarm\b",
        "processing": r"processing",
        "transport": r"transport",
        "packaging": r"packaging",
        "retail": r"retail",
    }
    stage_cols = {}
    for k, pat in stage_pats.items():
        m = [c for c in cols if re.search(pat, c, re.I) and re.search(r"(co2|co₂|ghg|emission)", c, re.I)]
        stage_cols[k] = m[0] if m else None
    det["stage_cols"] = stage_cols
    det["freshwater_cols"] = f(r"freshwater\s*withdrawals")
    det["scarcity_weighted_cols"] = f(r"scarcity[_\-\s]*weighted\s*water")
    return det

df = pd.read_csv(DATA)
df = df.rename(columns={c: normalize_name(c) for c in df.columns})

LABEL = detect_label(df)
det   = detect_cols(df.columns)

# Numeric coercion
for c in df.columns:
    if c != LABEL:
        df[c] = pd.to_numeric(df[c], errors="coerce")

# Handle negatives: log then set to NaN
num = df.select_dtypes(include=[np.number])
neg_mask = num < 0
if neg_mask.values.any():
    neg_positions = neg_mask.stack()
    neg_positions = neg_positions[neg_positions].reset_index()
    neg_positions.columns = ["row_idx","column","is_negative"]
    neg_positions.to_csv("Food_Production_negative_values.csv", index=False)
    for col in neg_positions["column"].unique():
        df.loc[df[col] < 0, col] = np.nan

# Total GHG per kg
total_col = det["total_ghg_perkg"][0] if det["total_ghg_perkg"] else None
stage_cols_found = [c for c in det["stage_cols"].values() if c]
if total_col and total_col in df.columns and df[total_col].notna().any():
    df["ghg_total_perkg"] = df[total_col]
elif stage_cols_found:
    df["ghg_total_perkg"] = df[stage_cols_found].sum(axis=1, min_count=1)
else:
    df["ghg_total_perkg"] = np.nan

# Stage shares
if stage_cols_found:
    denom = df["ghg_total_perkg"].replace({0: np.nan})
    for c in stage_cols_found:
        df[f"{c}_share"] = df[c] / denom

# Food class
def tag_food_group(x):
    n = str(x).lower()
    animal = ["beef","lamb","mutton","pork","chicken","poultry","turkey","egg","fish","seafood","shrimp","prawn","cheese","milk","dairy","yogurt","butter"]
    plant  = ["soy","tofu","legume","bean","pea","lentil","nuts","nut","almond","cashew","peanut","cereal","grain","wheat","rice","maize","corn","barley","oat","vegetable","fruit","tomato","banana","potato","oilseed","rapeseed","sunflower","olive","cocoa","coffee","tea"]
    if any(k in n for k in animal): return "animal"
    if any(k in n for k in plant):  return "plant"
    return "unknown"
df["food_class"] = df[LABEL].astype(str).map(tag_food_group)

df.to_csv("Food_Production_clean.csv", index=False)
df.head(10)


Unnamed: 0,food_product,land_use_change,animal_feed,farm,processing,transport,packging,retail,total_emissions,eutrophying_emissions_per_1000kcal_gpo₄eq_per_1000kcal,eutrophying_emissions_per_kilogram_gpo₄eq_per_kilogram,eutrophying_emissions_per_100g_protein_gpo₄eq_per_100_grams_protein,freshwater_withdrawals_per_1000kcal_liters_per_1000kcal,freshwater_withdrawals_per_100g_protein_liters_per_100g_protein,freshwater_withdrawals_per_kilogram_liters_per_kilogram,greenhouse_gas_emissions_per_1000kcal_kgco₂eq_per_1000kcal,greenhouse_gas_emissions_per_100g_protein_kgco₂eq_per_100g_protein,land_use_per_1000kcal_m²_per_1000kcal,land_use_per_kilogram_m²_per_kilogram,land_use_per_100g_protein_m²_per_100g_protein,scarcity_weighted_water_use_per_kilogram_liters_per_kilogram,scarcity_weighted_water_use_per_100g_protein_liters_per_100g_protein,scarcity_weighted_water_use_per_1000kcal_liters_per_1000_kilocalories,ghg_total_perkg,food_class
0,Wheat & Rye (Bread),0.1,0.0,0.8,0.2,0.1,0.1,0.1,1.4,,,,,,,,,,,,,,,,plant
1,Maize (Meal),0.3,0.0,0.5,0.1,0.1,0.1,0.0,1.1,,,,,,,,,,,,,,,,plant
2,Barley (Beer),0.0,0.0,0.2,0.1,0.0,0.5,0.3,1.1,,,,,,,,,,,,,,,,plant
3,Oatmeal,0.0,0.0,1.4,0.0,0.1,0.1,0.0,1.6,4.281357,11.23,8.638462,183.911552,371.076923,482.4,0.945482,1.907692,2.897446,7.6,5.846154,18786.2,14450.92308,7162.104461,,plant
4,Rice,0.0,0.0,3.6,0.1,0.1,0.1,0.1,4.0,9.514379,35.07,49.394366,609.983722,3166.760563,2248.4,1.207271,6.267606,0.759631,2.8,3.943662,49576.3,69825.77465,13449.89148,,plant
5,Potatoes,0.0,0.0,0.2,0.0,0.1,0.0,0.0,0.3,4.754098,3.48,20.470588,80.737705,347.647059,59.1,0.628415,2.705882,1.202186,0.88,5.176471,2754.2,16201.17647,3762.568306,,plant
6,Cassava,0.6,0.0,0.2,0.0,0.1,0.0,0.0,0.9,0.708419,0.69,7.666667,,,0.0,1.355236,14.666667,1.858316,1.81,20.111111,0.0,,,,unknown
7,Cane Sugar,1.2,0.0,0.5,0.0,0.8,0.1,0.0,2.6,4.820513,16.92,,176.666667,,620.1,0.911681,,0.581197,2.04,,16438.6,,4683.361823,,unknown
8,Beet Sugar,0.0,0.0,0.5,0.2,0.6,0.1,0.0,1.4,1.541311,5.41,,62.022792,,217.7,0.51567,,0.521368,1.83,,9493.3,,2704.643875,,unknown
9,Other Pulses,0.0,0.0,1.1,0.0,0.1,0.4,0.0,1.6,5.008798,17.08,7.977581,,203.503036,435.7,0.524927,0.836058,4.565982,15.57,7.272303,22477.4,10498.55208,,,unknown
