# Real Estate Appraisal System - Data Processing Notebook


## Introduction
This notebook will help us build an appraisal system that finds the 3 most similar properties to a subject property. We'll start by loading and understanding our data.


## Cell 1: Import Required Libraries


In [2]:
# Import the libraries we need
import json          # To read JSON files
import pandas as pd  # To work with data in table format
import numpy as np   # For numerical operations

print("Libraries imported successfully!")

Libraries imported successfully!


## Cell 2: Load the JSON Data

First, let's create a function to load our JSON file. JSON (JavaScript Object Notation) is a common format for storing structured data.


In [9]:
def load_json_file(filename):
    """
    This function reads a JSON file and returns its contents
    
    Parameters:
    - filename: path to your JSON file (e.g., 'data.json')
    
    Returns:
    - data: the JSON content as a Python dictionary
    """
    # Open the file in read mode
    with open(filename, 'r') as file:
        # Load the JSON content
        data = json.load(file)
    
    print(f"✅ Successfully loaded {filename}")
    return data

# Load your JSON file (replace with your actual filename)
data = load_json_file('appraisals_dataset.json')

✅ Successfully loaded appraisals_dataset.json


## Cell 3: Understand the JSON Structure

In [10]:
def explore_json_structure(data):
    """
    This function shows us what's inside the JSON file
    """
    print("🔍 JSON Structure:")
    print(f"   - Main keys: {list(data.keys())}")
    
    # Check appraisals section
    if 'appraisals' in data:
        print(f"\n📊 Appraisals section:")
        print(f"   - Number of appraisals: {len(data['appraisals'])}")
        if len(data['appraisals']) > 0:
            print(f"   - Keys in first appraisal: {list(data['appraisals'][0].keys())}")
    
    # Check properties section
    if 'properties' in data:
        print(f"\n🏠 Properties section:")
        print(f"   - Number of candidate properties: {len(data['properties'])}")

# Run this to explore your data
explore_json_structure(data)

🔍 JSON Structure:
   - Main keys: ['appraisals']

📊 Appraisals section:
   - Number of appraisals: 88
   - Keys in first appraisal: ['orderID', 'subject', 'comps', 'properties']


## Cell 4: Extract the Three Main Components

Our JSON has three important parts:
1. **Subject** - The property we want to appraise
2. **Comps** - 3 properties that experts selected as similar
3. **Candidates** - All available properties we can choose from

In [25]:
def extract_data_components_and_save(appraisals):
    """
    Flattens all subjects, comps, and candidates in the given appraisals and saves them as CSV files.
    """
    subject_list = []
    comp_list = []
    candidate_list = []
    
    for record in appraisals:
        order_id = record.get('orderID', None)
        
        # Subject
        subject = record.get('subject', {}).copy()
        subject['orderID'] = order_id
        subject_list.append(subject)
        
        # Comps
        for comp in record.get('comps', []):
            comp_flat = comp.copy()
            comp_flat['orderID'] = order_id
            comp_list.append(comp_flat)
        
        # Candidates
        for candidate in record.get('properties', []):
            candidate_flat = candidate.copy()
            candidate_flat['orderID'] = order_id
            candidate_list.append(candidate_flat)
    
    print("✅ Data extraction complete!")
    print(f" - Subjects: {len(subject_list)}")
    print(f" - Comps sets: {len(comp_list)}")
    print(f" - Candidates sets: {len(candidate_list)}")
    
    
    return subject_list, comp_list, candidate_list

# Example usage:
subject_list, comp_list, candidate_list = extract_data_components_and_save(data['appraisals'])


✅ Data extraction complete!
 - Subjects: 88
 - Comps sets: 264
 - Candidates sets: 9820


## Cell 5: Convert Flat Lists to DataFrames

This function takes the **flat lists** of subjects, comps, and candidates and turns each one into a Pandas DataFrame.

**How it works:**
- Takes three lists (`subjects`, `comps`, `candidates`).
- Converts each list into a separate DataFrame for easier data analysis and manipulation.
- Prints out the number of rows in each DataFrame for a quick sanity check.

**Returns:**  
- `subjects_df` — DataFrame of all subject properties  
- `comps_df` — DataFrame of all comps  
- `candidates_df` — DataFrame of all candidates

In [32]:
def create_dataframes_all(subjects, comps, candidates):
    """
    Converts ALL subjects, comps, and candidates into DataFrames.
    """
    # Subjects
    subjects_df = pd.DataFrame(subjects)

    # Comps
    comps_df = pd.DataFrame(comps)

    # Candidates
    candidates_df = pd.DataFrame(candidates)

    print("✅ All DataFrames created!")
    print(f"   - Subjects:   {len(subjects_df)}")
    print(f"   - Comps:      {len(comps_df)}")
    print(f"   - Candidates: {len(candidates_df)}")
    return subjects_df, comps_df, candidates_df

# Example usage:
subjects_df, comps_df, candidates_df = create_dataframes_all(subject_list, comp_list, candidate_list)


✅ All DataFrames created!
   - Subjects:   88
   - Comps:      264
   - Candidates: 9820


In [35]:
def combine_all_properties(subject_df, comps_df, candidates_df):
    """
    Combines all properties into one DataFrame for easier processing
    """
    # Combine all DataFrames vertically (stack them)
    all_properties = pd.concat([subject_df, comps_df, candidates_df], 
                              ignore_index=True,  # Reset index to 0,1,2,3...
                              sort=False)         # Keep column order
    
    print("✅ All properties combined!")
    print(f"   - Total properties: {len(all_properties)}")
    print(f"   - Total columns: {len(all_properties.columns)}")
    
    return all_properties

# Combine all properties
all_df = combine_all_properties(subjects_df, comps_df, candidates_df)

✅ All properties combined!
   - Total properties: 10172
   - Total columns: 69


In [37]:
all_df.to_csv("hello.csv")