# Build Ingredient Co-occurrence Network

This notebook builds a network of recipes based on ingredient co-occurrences. The network will be used for:
1. Infusion recipe suggestions (combining ingredients from different cuisines)
2. Cuisine classification (predicting cuisine from ingredient lists)

## Overview

The network construction process:
1. Load encoded recipe data from preprocessing pipeline
2. Compute ingredient co-occurrence statistics
3. Build NetworkX graph with ingredients as nodes and co-occurrences as edges
4. Analyze network properties (centrality, communities, etc.)
5. Save network for use in inference tasks


In [1]:
# Setup: Add classifier_pipeline to path
import sys
from pathlib import Path

# Add classifier_pipeline directory to path
pipeline_root = Path.cwd().parent
if str(pipeline_root) not in sys.path:
    sys.path.insert(0, str(pipeline_root))

print(f"Pipeline root: {pipeline_root}")
print(f"Python path includes: {pipeline_root.exists()}")


Pipeline root: c:\Users\georg.DESKTOP-2FS9VF1\source\repos\699-capstone-team14\classifier_pipeline
Python path includes: True


## Step 1: Load Configuration


In [2]:
import yaml
from pathlib import Path

# Load configuration
config_path = Path("../config/network_config.yaml")

if config_path.exists():
    with open(config_path, 'r') as f:
        config = yaml.safe_load(f)
    
    print("Configuration loaded:")
    data_cfg = config.get('data', {})
    network_cfg = config.get('network', {})
    
    print(f"  Input path: {data_cfg.get('input_path', 'N/A')}")
    print(f"  Ingredients column: {data_cfg.get('ingredients_col', 'N/A')}")
    print(f"  Cuisine column: {data_cfg.get('cuisine_col', 'N/A')}")
    print(f"\n  Network parameters:")
    print(f"    - Min co-occurrence: {network_cfg.get('min_cooccurrence', 'N/A')}")
    print(f"    - Weight method: {network_cfg.get('weight_method', 'N/A')}")
    print(f"    - Min ingredient frequency: {network_cfg.get('min_ingredient_freq', 'N/A')}")
else:
    print(f"Config file not found: {config_path}")


Configuration loaded:
  Input path: ../../preprocess_pipeline/data/combined_raw_datasets_with_cuisine_encoded.parquet
  Ingredients column: encoded_ingredients
  Cuisine column: cuisine_encoded

  Network parameters:
    - Min co-occurrence: 1
    - Weight method: frequency
    - Min ingredient frequency: 2


## Step 2: Load Recipe Data

Load the final output from the preprocessing pipeline.


In [3]:
import pandas as pd
from network.builder import NetworkBuilder

# Initialize network builder
builder = NetworkBuilder(
    min_cooccurrence=network_cfg.get('min_cooccurrence', 1),
    weight_method=network_cfg.get('weight_method', 'frequency'),
    normalize_weights=network_cfg.get('normalize_weights', True),
    min_ingredient_freq=network_cfg.get('min_ingredient_freq', 2),
)

# Load data
input_path = Path(data_cfg.get('input_path', '../../preprocess_pipeline/data/encoded_combined_datasets_with_cuisine_encoded.parquet'))
df = builder.load_data(input_path, ingredients_col=data_cfg.get('ingredients_col', 'encoded_ingredients'))

print(f"\nDataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()



Dataset shape: (63689, 8)
Columns: ['Dataset_ID', 'index', 'ingredients', 'cuisine', 'inferred_ingredients', 'encoded_ingredients', 'cuisine_deduped', 'cuisine_encoded']

First few rows:


Unnamed: 0,Dataset_ID,index,ingredients,cuisine,inferred_ingredients,encoded_ingredients,cuisine_deduped,cuisine_encoded
0,1,0,"[buttermilk cornbread, sandwich bread, salt, b...",[Southern & Soul Food],"[buttermilk cornbread, rice, salt water, peppe...","[16658, 155, 1456, 11, 506, 73, 23, 89, 6, 5, ...",[[southern & soul food]],[0]
1,1,1,"[Country Crock® Spread, light corn syrup, crea...",[American],"[[, country, crock, spread, corn syrup, peanut...","[0, 533, 13950, 803, 162, 19, 1082, 378]",[[american]],[0]
2,1,2,"[Skippy® Super Chunk® Peanut Butter, Country C...",[American],"[super chunk, ®, peanut, butter, country, croc...","[0, 0, 702, 6, 533, 13950, 803, 1082, 729]",[[american]],[0]
3,1,3,"[light mayonnaise, lemon juice, cayenne pepper...",[American],"[mayonnaise, lemon juice, cayenne pepper, blue...","[66, 56, 345, 1760, 458, 22, 16]",[[american]],[0]
4,1,4,"[elbow macaroni, hellmann' or best food real m...",[American],"[elbow, hellmann, red vinegar, wine, dijonnais...","[612, 49176, 103, 212, 49176, 3073, 3339, 1459...",[[american]],[0]


## Step 2.5: Diagnostic - Inspect Ingredients Data

Before building the network, let's check the format and content of the ingredients column.


In [4]:
import numpy as np
from collections import Counter
import pandas as pd # Assuming pd was imported earlier

# --- (Previous code setup) ---

# Diagnostic: Check ingredients column
ingredients_col = data_cfg.get('ingredients_col', 'encoded_ingredients')
print(f"Diagnostic: Checking '{ingredients_col}' column...")
# ... (rest of the sample printing code is fine) ...

# Count total non-zero ingredients across all rows
print(f"\n Counting non-zero ingredients across all rows...")
nonzero_count = 0
rows_with_ingredients = 0

# --- FIX 1 (in the first loop) ---
for idx, row in df.iterrows():
    val = row[ingredients_col]
    
    # Check for the expected type *first*.
    # This will safely skip np.nan, None, strings, etc.
    if isinstance(val, (list, tuple, np.ndarray)):
        ing_list = list(val)
        non_zero = [i for i in ing_list if i and i != 0]
        if len(non_zero) > 0:
            rows_with_ingredients += 1
        nonzero_count += len(non_zero)

print(f" Rows with non-zero ingredients: {rows_with_ingredients:,} / {len(df):,}")
print(f" Total non-zero ingredient IDs found: {nonzero_count:,}")

# Check ingredient frequency distribution
print(f"\n Computing ingredient frequency (this may take a moment)...")
ingredient_counter = Counter()

# --- FIX 2 (in the second loop) ---
for idx, row in df.iterrows():
    val = row[ingredients_col]
    
    # Apply the same logic: check for type *first*.
    if isinstance(val, (list, tuple, np.ndarray)):
        for ing_id in val:
            if ing_id and ing_id != 0:
                ingredient_counter[ing_id] += 1

print(f" Unique non-zero ingredient IDs: {len(ingredient_counter):,}")
if len(ingredient_counter) > 0:
    freq_dist = Counter(ingredient_counter.values())
    print(f" Frequency distribution:")
    for freq, count in sorted(freq_dist.items())[:10]:
        print(f" Ingredients appearing {freq} time(s): {count:,}")
        
    # Assuming network_cfg is defined
    min_freq = network_cfg.get('min_ingredient_freq', 2)
    print(f" Ingredients appearing >= {min_freq} times: {sum(1 for f in ingredient_counter.values() if f >= min_freq):,}")

Diagnostic: Checking 'encoded_ingredients' column...

 Counting non-zero ingredients across all rows...
 Rows with non-zero ingredients: 61,521 / 63,689
 Total non-zero ingredient IDs found: 649,479

 Computing ingredient frequency (this may take a moment)...
 Unique non-zero ingredient IDs: 9,415
 Frequency distribution:
 Ingredients appearing 1 time(s): 2,566
 Ingredients appearing 2 time(s): 1,682
 Ingredients appearing 3 time(s): 769
 Ingredients appearing 4 time(s): 525
 Ingredients appearing 5 time(s): 367
 Ingredients appearing 6 time(s): 299
 Ingredients appearing 7 time(s): 222
 Ingredients appearing 8 time(s): 214
 Ingredients appearing 9 time(s): 157
 Ingredients appearing 10 time(s): 135
 Ingredients appearing >= 2 times: 6,849


## Step 3: Build Network

Compute co-occurrence statistics and build the ingredient network graph.

**Note**: If the network has 0 nodes, check the diagnostic output above. You may need to:
- Lower `min_ingredient_freq` in the config (currently filtering out ingredients that appear < 2 times)
- Check if the ingredients column is being read correctly


In [5]:
# Compute statistics
builder.compute_statistics(df, ingredients_col=data_cfg.get('ingredients_col', 'encoded_ingredients'))

# Build graph
graph = builder.build_graph()

# Normalize weights if needed
if network_cfg.get('normalize_weights', True):
    graph = builder.normalize_graph_weights(graph)
    builder.graph = graph  # Update builder's graph reference

print(f"\nNetwork built:")
print(f"  Nodes (ingredients): {graph.number_of_nodes():,}")
print(f"  Edges (co-occurrences): {graph.number_of_edges():,}")



Network built:
  Nodes (ingredients): 6,849
  Edges (co-occurrences): 438,742


## Step 4: Analyze Network

Analyze network properties and identify important ingredients.


In [6]:
from network.analysis import NetworkAnalyzer
from network.graph import IngredientGraph

# Wrap graph in IngredientGraph for easier access
print("A")
ing_graph = IngredientGraph(graph)
print("B")
# Analyze network
analyzer = NetworkAnalyzer(graph)
print("C")
# Get network statistics
stats = analyzer.get_network_statistics()
print("Network Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value}")
print("D")
# Compute centrality measures
centralities = analyzer.compute_centrality_measures()
print("E")
# Get top ingredients by degree
top_ingredients = analyzer.get_top_ingredients('degree', top_k=10)
print(f"\nTop 10 ingredients by degree centrality:")
for ing_id, score in top_ingredients:
    print(f"  Ingredient {ing_id}: {score:.4f}")


Path-length statistics (diameter, average_path_length) skipped for performance. Graph has 6,849 nodes. These calculations require O(n^2) all-pairs shortest paths and can hang on large graphs.


A
B
C
Network Statistics:
  num_nodes: 6849
  num_edges: 438742
  density: 0.018708901497319345
  is_connected: False
  largest_component_size: 6842
  diameter: 0
  average_path_length: 0.0
  avg_degree: 128.11855745364286
  max_degree: 5437
  min_degree: 0
D
E

Top 10 ingredients by degree centrality:
  Ingredient 1456: 0.7940
  Ingredient 11: 0.6650
  Ingredient 1082: 0.6202
  Ingredient 179: 0.6189
  Ingredient 6: 0.5869
  Ingredient 23: 0.5853
  Ingredient 158: 0.5605
  Ingredient 5: 0.5483
  Ingredient 28: 0.5409
  Ingredient 9105: 0.5190


## Step 5: Save Network

Save the network for use in inference tasks.


In [7]:
# Save graph
graph_output = Path(network_cfg.get('graph_output', './data/ingredient_network.graphml'))
builder.save_graph(graph_output, format='graphml')

print(f"\nNetwork saved to: {graph_output}")
print("Network is ready for use in inference tasks!")



Network saved to: data\ingredient_network.graphml
Network is ready for use in inference tasks!
