# Generate Noun Property Ratings (Category-Based)

This notebook rates all nouns on their category-specific properties and saves to JSON/JSONL.

**Input:**
- `categories_tree.json` - Nouns organized by category
- `category_properties.json` - Properties for each category

**Output:** 
- `noun_property_ratings.jsonl` - JSONL format (one JSON object per line)
- `noun_property_nested.json` - Nested JSON format (nouns with property objects)

## 1. Setup

In [1]:
import torch
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from tqdm import tqdm
import os
import re
import json



## 2. Configuration

In [2]:
# File paths
CATEGORIES_FILE = "categories_tree.json"
PROPERTIES_FILE = "category_properties.json"
OUTPUT_FILE = "grouped_noun_property_ratings.jsonl"  # JSONL format (one JSON object per line)

# How often to save progress
SAVE_FREQUENCY = 200

## 3. Load Model

In [3]:
print("Loading model...")

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="mps",  # Change to "cuda" for NVIDIA GPU or "cpu" for CPU, mps for mac
    torch_dtype="auto",
    trust_remote_code=False,
)

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=50,
    do_sample=False,
)

print("Model loaded!")

Loading model...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Device set to use mps
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Model loaded!


## 4. Load Categories and Properties

In [4]:
# Load categories tree
with open(CATEGORIES_FILE, 'r') as f:
    categories_data = json.load(f)

# Load properties for each category
with open(PROPERTIES_FILE, 'r') as f:
    properties_data = json.load(f)

# Extract noun-to-category mapping
noun_to_category = {}
category_to_nouns = categories_data['Categories']

for category, nouns in category_to_nouns.items():
    for noun in nouns:
        noun_to_category[noun] = category

# Get all unique nouns
all_nouns = list(noun_to_category.keys())

print(f"Loaded {len(all_nouns)} nouns across {len(category_to_nouns)} categories")
print(f"\nCategories:")
for category in category_to_nouns.keys():
    noun_count = len(category_to_nouns[category])
    #prop_count = len(properties_data['category_properties'][category]['properties'])
    prop_count = len(properties_data[category]['properties'])
    print(f"  {category}: {noun_count} nouns, {prop_count} properties")

# Calculate total ratings needed
total_ratings = sum(
    #len(nouns) * len(properties_data['category_properties'][category]['properties'])
    len(nouns) * len(properties_data[category]['properties'])
    for category, nouns in category_to_nouns.items()
)

print(f"\nTotal ratings needed: {total_ratings:,}")
print(f"Estimated time: {total_ratings * 2 / 3600:.1f} hours (at 2 sec/rating)")

Loaded 1852 nouns across 20 categories

Categories:
  Domesticated Mammals: 33 nouns, 10 properties
  Wild Mammals: 72 nouns, 10 properties
  Reptiles & Amphibians: 51 nouns, 10 properties
  Birds, Fish, Insects & Other Animals: 44 nouns, 10 properties
  Trees: 12 nouns, 10 properties
  Non-Tree Plants: 25 nouns, 10 properties
  Fruits: 27 nouns, 10 properties
  Other Raw Foods: 16 nouns, 10 properties
  Prepared Foods & Meals: 51 nouns, 10 properties
  Aircraft: 16 nouns, 10 properties
  Motorcycles: 37 nouns, 10 properties
  Cars, Trucks, Boats & Other Motorized Vehicles: 65 nouns, 10 properties
  Non-Motorized Transportation: 11 nouns, 10 properties
  Geological Features & Minerals: 66 nouns, 10 properties
  Natural Non-Geological Features: 60 nouns, 10 properties
  Buildings & Large Structures: 28 nouns, 10 properties
  Handheld Objects & Tools: 78 nouns, 10 properties
  Transportation Infrastructure: 19 nouns, 10 properties
  Abstract Concepts, Places, People & Activities: 996 nou

## 5. Rating Function

In [5]:
def rate_noun_on_property(noun, property_info, pipe):
    """
    Rate a noun on a property from 1-10.
    Optimized for microsoft/Phi-3-mini-4k-instruct.
    Returns -1 if rating extraction fails.
    """
    property_name = property_info['name']
    description = property_info['description']
    scale = property_info['scale']
    
    # Extract scale endpoints
    scale_parts = scale.split(' to ')
    if len(scale_parts) == 2:
        low_end = scale_parts[0].replace('1 ', '').strip('()')
        high_end = scale_parts[1].replace('10 ', '').strip('()')
    else:
        low_end = "minimum"
        high_end = "maximum"
    
    # Phi-3-mini optimized prompt: short, direct, numbered format
    prompt = f"""Rate "{noun}" for: {property_name}
{description}

Scale:
1={low_end}
10={high_end}

Output only the number 1-10."""
    
    messages = [{"role": "user", "content": prompt}]
    
    try:
        output = pipe(messages)
        response = output[0]["generated_text"].strip()
        
        # Extract first valid number from response
        numbers = re.findall(r'\b([1-9]|10)\b', response)
        if numbers:
            rating = int(numbers[0])
            if 1 <= rating <= 10:
                return rating
        
        return -1
    except Exception as e:
        print(f"Error: {noun} - {property_name}: {str(e)}")
        return -1

## 6. Test Rating Function

In [6]:
# Quick test with a domesticated mammal
print("Testing...")
test_category = "Domesticated Mammals"
test_noun = "elephant" if "elephant" in category_to_nouns.get(test_category, []) else category_to_nouns[test_category][0]
#test_property = properties_data['category_properties'][test_category]['properties'][0]
test_property = properties_data[test_category]['properties'][0]

test_rating = rate_noun_on_property(test_noun, test_property, pipe)
print(f"{test_noun} - {test_property['name']}: {test_rating}/10")
print(f"Property description: {test_property['description']}")
print(f"Scale: {test_property['scale']}")

Testing...
airedale terrier - size: 7/10
Property description: Physical size of the animal
Scale: 1 = very small (hamster, guinea pig), 5 = medium (cat, small dog), 10 = very large (cow, horse)


## 7. Generate All Ratings

In [7]:
# Check for existing progress
if os.path.exists(OUTPUT_FILE):
    existing_ratings = []
    with open(OUTPUT_FILE, 'r') as f:
        for line in f:
            existing_ratings.append(json.loads(line))
    completed = set((r['noun'], r['property']) for r in existing_ratings)
    print(f"Found {len(existing_ratings)} existing ratings")
else:
    existing_ratings = []
    completed = set()
    print("Starting fresh")

# Create list of all work (noun, category, property_info)
all_combinations = []
for category, nouns in category_to_nouns.items():
    #properties = properties_data['category_properties'][category]['properties']
    properties = properties_data[category]['properties']
    for noun in nouns:
        for prop_info in properties:
            all_combinations.append((noun, category, prop_info))

# Filter out completed work
remaining = [
    combo for combo in all_combinations 
    if (combo[0], combo[2]['name']) not in completed
]

print(f"Remaining: {len(remaining):,} ratings")
print("\nStarting...\n")

Found 2000 existing ratings
Remaining: 16,520 ratings

Starting...



In [8]:
# Main loop
results = []

for i, (noun, category, prop_info) in enumerate(tqdm(remaining, desc="Rating")):
    # Get rating
    rating = rate_noun_on_property(noun, prop_info, pipe)
    
    # Store
    results.append({
        'noun': noun,
        'category': category,
        'property': prop_info['name'],
        'rating': rating
    })
    
    # Save periodically
    if (i + 1) % SAVE_FREQUENCY == 0:
        # Append new results to file
        with open(OUTPUT_FILE, 'a') as f:
            for result in results:
                f.write(json.dumps(result) + '\n')
        results = []
        print(f"Saved at {i + 1} ratings")

# Final save
if results:
    with open(OUTPUT_FILE, 'a') as f:
        for result in results:
            f.write(json.dumps(result) + '\n')

print(f"\nDone! Saved to {OUTPUT_FILE}")

Rating:   1%|          | 200/16520 [01:22<1:51:51,  2.43it/s]

Saved at 200 ratings


Rating:   2%|▏         | 400/16520 [02:52<2:15:27,  1.98it/s]

Saved at 400 ratings


Rating:   4%|▎         | 600/16520 [04:29<2:10:26,  2.03it/s]

Saved at 600 ratings


Rating:   5%|▍         | 800/16520 [06:45<2:05:08,  2.09it/s] 

Saved at 800 ratings


Rating:   6%|▌         | 1000/16520 [08:26<2:05:16,  2.06it/s]

Saved at 1000 ratings


Rating:   7%|▋         | 1200/16520 [10:47<2:05:43,  2.03it/s] 

Saved at 1200 ratings


Rating:   8%|▊         | 1400/16520 [12:28<2:03:47,  2.04it/s]

Saved at 1400 ratings


Rating:  10%|▉         | 1600/16520 [14:27<2:01:46,  2.04it/s] 

Saved at 1600 ratings


Rating:  11%|█         | 1800/16520 [16:51<2:01:24,  2.02it/s] 

Saved at 1800 ratings


Rating:  12%|█▏        | 2000/16520 [18:33<1:54:50,  2.11it/s]

Saved at 2000 ratings


Rating:  13%|█▎        | 2200/16520 [20:12<1:54:25,  2.09it/s]

Saved at 2200 ratings


Rating:  15%|█▍        | 2400/16520 [21:45<1:45:27,  2.23it/s]

Saved at 2400 ratings


Rating:  16%|█▌        | 2600/16520 [23:27<1:48:21,  2.14it/s]

Saved at 2600 ratings


Rating:  17%|█▋        | 2800/16520 [24:58<1:43:31,  2.21it/s]

Saved at 2800 ratings


Rating:  18%|█▊        | 3000/16520 [26:28<1:41:38,  2.22it/s]

Saved at 3000 ratings


Rating:  19%|█▉        | 3200/16520 [27:58<1:40:28,  2.21it/s]

Saved at 3200 ratings


Rating:  21%|██        | 3400/16520 [29:33<1:49:23,  2.00it/s]

Saved at 3400 ratings


Rating:  22%|██▏       | 3600/16520 [31:19<1:40:33,  2.14it/s]

Saved at 3600 ratings


Rating:  23%|██▎       | 3800/16520 [33:15<1:41:25,  2.09it/s]

Saved at 3800 ratings


Rating:  24%|██▍       | 4000/16520 [34:52<1:41:01,  2.07it/s]

Saved at 4000 ratings


Rating:  25%|██▌       | 4200/16520 [36:34<1:41:52,  2.02it/s]

Saved at 4200 ratings


Rating:  27%|██▋       | 4400/16520 [38:08<1:42:52,  1.96it/s]

Saved at 4400 ratings


Rating:  28%|██▊       | 4600/16520 [39:43<1:43:55,  1.91it/s]

Saved at 4600 ratings


Rating:  29%|██▉       | 4800/16520 [41:17<1:37:31,  2.00it/s]

Saved at 4800 ratings


Rating:  30%|███       | 5000/16520 [42:52<1:28:51,  2.16it/s]

Saved at 5000 ratings


Rating:  31%|███▏      | 5200/16520 [44:30<1:28:37,  2.13it/s]

Saved at 5200 ratings


Rating:  33%|███▎      | 5400/16520 [46:10<1:28:06,  2.10it/s]

Saved at 5400 ratings


Rating:  34%|███▍      | 5600/16520 [48:16<1:25:42,  2.12it/s] 

Saved at 5600 ratings


Rating:  35%|███▌      | 5800/16520 [49:50<1:23:22,  2.14it/s]

Saved at 5800 ratings


Rating:  36%|███▋      | 6000/16520 [51:26<1:20:17,  2.18it/s]

Saved at 6000 ratings


Rating:  38%|███▊      | 6200/16520 [53:02<1:18:22,  2.19it/s]

Saved at 6200 ratings


Rating:  39%|███▊      | 6400/16520 [54:43<1:15:47,  2.23it/s]

Saved at 6400 ratings


Rating:  40%|███▉      | 6600/16520 [56:23<1:15:05,  2.20it/s]

Saved at 6600 ratings


Rating:  41%|████      | 6800/16520 [58:20<3:46:26,  1.40s/it]

Saved at 6800 ratings


Rating:  42%|████▏     | 7000/16520 [1:00:03<1:12:31,  2.19it/s]

Saved at 7000 ratings


Rating:  44%|████▎     | 7200/16520 [1:01:47<1:10:15,  2.21it/s]

Saved at 7200 ratings


Rating:  45%|████▍     | 7400/16520 [1:03:41<1:12:45,  2.09it/s]

Saved at 7400 ratings


Rating:  46%|████▌     | 7600/16520 [1:05:20<1:08:04,  2.18it/s]

Saved at 7600 ratings


Rating:  47%|████▋     | 7800/16520 [1:07:04<1:06:45,  2.18it/s]

Saved at 7800 ratings


Rating:  48%|████▊     | 8000/16520 [1:08:56<1:05:29,  2.17it/s]

Saved at 8000 ratings


Rating:  50%|████▉     | 8200/16520 [1:10:51<3:16:59,  1.42s/it]

Saved at 8200 ratings


Rating:  51%|█████     | 8400/16520 [1:12:42<6:00:11,  2.66s/it]

Saved at 8400 ratings


Rating:  52%|█████▏    | 8600/16520 [1:14:15<1:00:17,  2.19it/s]

Saved at 8600 ratings


Rating:  53%|█████▎    | 8800/16520 [1:15:48<1:00:13,  2.14it/s]

Saved at 8800 ratings


Rating:  54%|█████▍    | 9000/16520 [1:17:36<58:28,  2.14it/s]  

Saved at 9000 ratings


Rating:  56%|█████▌    | 9200/16520 [1:19:28<56:13,  2.17it/s]  

Saved at 9200 ratings


Rating:  57%|█████▋    | 9400/16520 [1:21:22<53:51,  2.20it/s]  

Saved at 9400 ratings


Rating:  58%|█████▊    | 9600/16520 [1:22:56<55:30,  2.08it/s]  

Saved at 9600 ratings


Rating:  59%|█████▉    | 9800/16520 [1:24:43<2:35:24,  1.39s/it]

Saved at 9800 ratings


Rating:  61%|██████    | 10000/16520 [1:26:26<49:23,  2.20it/s] 

Saved at 10000 ratings


Rating:  62%|██████▏   | 10200/16520 [1:28:05<46:46,  2.25it/s]  

Saved at 10200 ratings


Rating:  63%|██████▎   | 10400/16520 [1:29:43<45:28,  2.24it/s]  

Saved at 10400 ratings


Rating:  64%|██████▍   | 10600/16520 [1:31:13<44:28,  2.22it/s]

Saved at 10600 ratings


Rating:  65%|██████▌   | 10800/16520 [1:32:52<42:59,  2.22it/s]  

Saved at 10800 ratings


Rating:  67%|██████▋   | 11000/16520 [1:34:24<42:22,  2.17it/s]

Saved at 11000 ratings


Rating:  68%|██████▊   | 11200/16520 [1:36:06<43:19,  2.05it/s]  

Saved at 11200 ratings


Rating:  69%|██████▉   | 11400/16520 [1:37:51<39:06,  2.18it/s]  

Saved at 11400 ratings


Rating:  70%|███████   | 11600/16520 [1:39:23<40:07,  2.04it/s]

Saved at 11600 ratings


Rating:  71%|███████▏  | 11800/16520 [1:41:13<56:20,  1.40it/s]  

Saved at 11800 ratings


Rating:  73%|███████▎  | 12000/16520 [1:42:54<36:37,  2.06it/s]  

Saved at 12000 ratings


Rating:  74%|███████▍  | 12200/16520 [1:44:35<34:35,  2.08it/s]  

Saved at 12200 ratings


Rating:  75%|███████▌  | 12400/16520 [1:46:25<33:27,  2.05it/s]  

Saved at 12400 ratings


Rating:  76%|███████▋  | 12600/16520 [1:47:58<31:40,  2.06it/s]

Saved at 12600 ratings


Rating:  77%|███████▋  | 12800/16520 [1:49:33<27:27,  2.26it/s]  

Saved at 12800 ratings


Rating:  79%|███████▊  | 13000/16520 [1:51:44<27:46,  2.11it/s]  

Saved at 13000 ratings


Rating:  80%|███████▉  | 13200/16520 [1:53:14<24:53,  2.22it/s]

Saved at 13200 ratings


Rating:  81%|████████  | 13400/16520 [1:55:13<2:46:08,  3.20s/it]

Saved at 13400 ratings


Rating:  82%|████████▏ | 13600/16520 [1:56:47<21:12,  2.29it/s]  

Saved at 13600 ratings


Rating:  84%|████████▎ | 13800/16520 [1:58:16<19:41,  2.30it/s]

Saved at 13800 ratings


Rating:  85%|████████▍ | 14000/16520 [1:59:49<18:55,  2.22it/s]  

Saved at 14000 ratings


Rating:  86%|████████▌ | 14200/16520 [2:01:40<17:02,  2.27it/s]  

Saved at 14200 ratings


Rating:  87%|████████▋ | 14400/16520 [2:03:10<15:52,  2.23it/s]

Saved at 14400 ratings


Rating:  88%|████████▊ | 14600/16520 [2:04:49<14:47,  2.16it/s]  

Saved at 14600 ratings


Rating:  90%|████████▉ | 14800/16520 [2:06:59<41:22,  1.44s/it]  

Saved at 14800 ratings


Rating:  91%|█████████ | 15000/16520 [2:08:45<11:54,  2.13it/s]

Saved at 15000 ratings


Rating:  92%|█████████▏| 15200/16520 [2:10:40<12:21,  1.78it/s]

Saved at 15200 ratings


Rating:  93%|█████████▎| 15400/16520 [2:12:34<10:47,  1.73it/s]

Saved at 15400 ratings


Rating:  94%|█████████▍| 15600/16520 [2:14:26<09:01,  1.70it/s]

Saved at 15600 ratings


Rating:  96%|█████████▌| 15800/16520 [2:16:17<06:37,  1.81it/s]

Saved at 15800 ratings


Rating:  97%|█████████▋| 16000/16520 [2:18:08<04:50,  1.79it/s]

Saved at 16000 ratings


Rating:  98%|█████████▊| 16200/16520 [2:19:59<02:51,  1.86it/s]

Saved at 16200 ratings


Rating:  99%|█████████▉| 16400/16520 [2:21:47<01:10,  1.70it/s]

Saved at 16400 ratings


Rating: 100%|██████████| 16520/16520 [2:22:53<00:00,  1.93it/s]


Done! Saved to grouped_noun_property_ratings.jsonl





## 8. View Results

In [9]:
# Load and display results
ratings = []
with open(OUTPUT_FILE, 'r') as f:
    for line in f:
        ratings.append(json.loads(line))

print(f"Total ratings: {len(ratings):,}")
failed = sum(1 for r in ratings if r['rating'] == -1)
print(f"Failed ratings: {failed}")
print(f"Success rate: {(len(ratings) - failed) / len(ratings) * 100:.1f}%")

print("\nFirst 20 ratings:")
for rating in ratings[:20]:
    print(rating)

Total ratings: 18,520
Failed ratings: 6
Success rate: 100.0%

First 20 ratings:
{'noun': 'airedale terrier', 'category': 'Domesticated Mammals', 'property': 'size', 'rating': 7}
{'noun': 'airedale terrier', 'category': 'Domesticated Mammals', 'property': 'commonness_as_pet', 'rating': 8}
{'noun': 'airedale terrier', 'category': 'Domesticated Mammals', 'property': 'energy_level', 'rating': 8}
{'noun': 'airedale terrier', 'category': 'Domesticated Mammals', 'property': 'grooming_needs', 'rating': 7}
{'noun': 'airedale terrier', 'category': 'Domesticated Mammals', 'property': 'intelligence_trainability', 'rating': 8}
{'noun': 'airedale terrier', 'category': 'Domesticated Mammals', 'property': 'vocalization_level', 'rating': 7}
{'noun': 'airedale terrier', 'category': 'Domesticated Mammals', 'property': 'lifespan', 'rating': 8}
{'noun': 'airedale terrier', 'category': 'Domesticated Mammals', 'property': 'indoor_suitability', 'rating': 8}
{'noun': 'airedale terrier', 'category': 'Domesticat

In [10]:
# Create nested format (optional - easier to work with)
# Structure: { noun: { category: str, properties: { property: rating, ... } }, ... }
nested = {}
for rating in ratings:
    if rating['rating'] != -1:  # Skip failed ratings
        noun = rating['noun']
        if noun not in nested:
            nested[noun] = {
                'category': rating['category'],
                'properties': {}
            }
        nested[noun]['properties'][rating['property']] = rating['rating']

with open('noun_property_nested.json', 'w') as f:
    json.dump(nested, f, indent=2)

print("Also saved nested format to: noun_property_nested.json")
print(f"Structure: {len(nested)} nouns with category and properties as nested objects")

Also saved nested format to: noun_property_nested.json
Structure: 1852 nouns with category and properties as nested objects


## 9. Summary Statistics by Category

In [11]:
# Show statistics by category
from collections import defaultdict

category_stats = defaultdict(lambda: {'total': 0, 'failed': 0})

for rating in ratings:
    cat = rating['category']
    category_stats[cat]['total'] += 1
    if rating['rating'] == -1:
        category_stats[cat]['failed'] += 1

print("\nStatistics by Category:")
print(f"{'Category':<45} {'Total':<10} {'Failed':<10} {'Success Rate':<15}")
print("="*80)

for cat in sorted(category_stats.keys()):
    stats = category_stats[cat]
    success_rate = (stats['total'] - stats['failed']) / stats['total'] * 100
    print(f"{cat:<45} {stats['total']:<10} {stats['failed']:<10} {success_rate:<15.1f}%")


Statistics by Category:
Category                                      Total      Failed     Success Rate   
Abstract Concepts, Places, People & Activities 9960       2          100.0          %
Aircraft                                      160        0          100.0          %
Birds, Fish, Insects & Other Animals          440        0          100.0          %
Buildings & Large Structures                  280        0          100.0          %
Cars, Trucks, Boats & Other Motorized Vehicles 650        0          100.0          %
Domesticated Mammals                          330        3          99.1           %
Fruits                                        270        0          100.0          %
Geological Features & Minerals                660        0          100.0          %
Handheld Objects & Tools                      780        0          100.0          %
Motorcycles                                   370        1          99.7           %
Natural Non-Geological Features        