# Real Estate Appraisal System - Data Processing Notebook
___

## Cell 1: Import Required Libraries


In [34]:
# Import the libraries we need
import json          # To read JSON files
import pandas as pd  # To work with data in table format
import numpy as np   # For numerical operations
import os

print("Libraries imported successfully!")

Libraries imported successfully!


## Cell 2: Load the JSON Data

First, let's create a function to load our JSON file. JSON (JavaScript Object Notation) is a common format for storing structured data.


In [35]:
def load_json_file(filename):
    """
    This function reads a JSON file and returns its contents
    
    Parameters:
    - filename: path to your JSON file (e.g., 'data.json')
    
    Returns:
    - data: the JSON content as a Python dictionary
    """
    # Open the file in read mode
    with open(filename, 'r') as file:
        # Load the JSON content
        data = json.load(file)
    
    print(f"‚úÖ Successfully loaded {filename}")
    return data

# Load your JSON file (replace with your actual filename)
data = load_json_file('../data/appraisal/appraisals_dataset.json')

‚úÖ Successfully loaded ../data/appraisal/appraisals_dataset.json


## Cell 3: Understand the JSON Structure

In [36]:
def explore_json_structure(data):
    """
    This function shows us what's inside the JSON file
    """
    print("üîç JSON Structure:")
    print(f"   - Main keys: {list(data.keys())}")
    
    # Check appraisals section
    if 'appraisals' in data:
        print(f"\nüìä Appraisals section:")
        print(f"   - Number of appraisals: {len(data['appraisals'])}")
        if len(data['appraisals']) > 0:
            print(f"   - Keys in first appraisal: {list(data['appraisals'][0].keys())}")
    
    # Check properties section
    if 'properties' in data:
        print(f"\nüè† Properties section:")
        print(f"   - Number of candidate properties: {len(data['properties'])}")

# Run this to explore your data
explore_json_structure(data)

üîç JSON Structure:
   - Main keys: ['appraisals']

üìä Appraisals section:
   - Number of appraisals: 88
   - Keys in first appraisal: ['orderID', 'subject', 'comps', 'properties']


## Cell 4: Extract the Three Main Components

Our JSON has three important parts:
1. **Subject** - The property we want to appraise
2. **Comps** - 3 properties that experts selected as similar
3. **Candidates** - All available properties we can choose from

In [37]:
def extract_data_components_and_save(appraisals):
    """
    Flattens all subjects, comps, and candidates in the given appraisals and saves them as CSV files.
    """
    subject_list = []
    comp_list = []
    candidate_list = []
    
    for record in appraisals:
        order_id = record.get('orderID', None)
        
        # Subject
        subject = record.get('subject', {}).copy()
        subject['orderID'] = order_id
        subject_list.append(subject)
        
        # Comps
        for comp in record.get('comps', []):
            comp_flat = comp.copy()
            comp_flat['orderID'] = order_id
            comp_list.append(comp_flat)
        
        # Candidates
        for candidate in record.get('properties', []):
            candidate_flat = candidate.copy()
            candidate_flat['orderID'] = order_id
            candidate_list.append(candidate_flat)
    
    print("‚úÖ Data extraction complete!")
    print(f" - Subjects: {len(subject_list)}")
    print(f" - Comps sets: {len(comp_list)}")
    print(f" - Candidates sets: {len(candidate_list)}")
    
    
    return subject_list, comp_list, candidate_list

# Example usage:
subject_list, comp_list, candidate_list = extract_data_components_and_save(data['appraisals'])


‚úÖ Data extraction complete!
 - Subjects: 88
 - Comps sets: 264
 - Candidates sets: 9820


## Cell 5: Convert Flat Lists to DataFrames

This function takes the **flat lists** of subjects, comps, and candidates and turns each one into a Pandas DataFrame.

**How it works:**
- Takes three lists (`subjects`, `comps`, `candidates`).
- Converts each list into a separate DataFrame for easier data analysis and manipulation.
- Prints out the number of rows in each DataFrame for a quick sanity check.

**Returns:**  
- `subjects_df` ‚Äî DataFrame of all subject properties  
- `comps_df` ‚Äî DataFrame of all comps  
- `candidates_df` ‚Äî DataFrame of all candidates


In [38]:
def create_dataframes_all(subjects, comps, candidates):
    """
    Converts ALL subjects, comps, and candidates into DataFrames.
    """
    # Subjects
    subjects_df = pd.DataFrame(subjects)

    # Comps
    comps_df = pd.DataFrame(comps)

    # Candidates
    candidates_df = pd.DataFrame(candidates)

    print("‚úÖ All DataFrames created!")
    print(f"   - Subjects:   {len(subjects_df)}")
    print(f"   - Comps:      {len(comps_df)}")
    print(f"   - Candidates: {len(candidates_df)}")
    return subjects_df, comps_df, candidates_df

# Example usage:
subjects_df, comps_df, candidates_df = create_dataframes_all(subject_list, comp_list, candidate_list)


‚úÖ All DataFrames created!
   - Subjects:   88
   - Comps:      264
   - Candidates: 9820


## Cell 6: Quick DataFrame Exploration Utility

This function gives you a **quick overview** of any DataFrame, making it easy to check your data after loading or cleaning.

**What it does:**
- Prints the **shape** of your DataFrame (number of rows and columns).
- Shows how many records there are for each `orderID` (helpful for grouping).
- Lists all **column names** (in groups of 5 for readability).
- Displays the **data type** breakdown for your columns (e.g., int, float, object).
- Shows the **first 3 rows** so you can visually inspect your data.

**Returns:**  
- The first 3 rows of your DataFrame for a quick glance.

---

**Example usage:**
```python
explore_dataframe(subjects_df)
explore_dataframe(comps_df)
explore_dataframe(candidates_df)


In [39]:
def explore_dataframe(df):
    """
    Shows basic information about our DataFrame
    """
    print("üìä DATAFRAME OVERVIEW")
    print("=" * 50)
    
    # Basic info
    print(f"\nüìè Shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
    
    # Data roles
    print(f"\nüè∑Ô∏è  Data Roles:")
    print(df['orderID'].value_counts())
    
    # Column names
    print(f"\nüìã Columns ({len(df.columns)} total):")
    # Print columns in groups of 5 for readability
    cols = list(df.columns)
    for i in range(0, len(cols), 5):
        print(f"   {cols[i:i+5]}")
    
    # Data types
    print(f"\nüî§ Data Types Summary:")
    print(df.dtypes.value_counts())
    
    # Show first few rows
    print(f"\nüëÄ First 3 rows:")
    return df.head(3)

# Explore the combined DataFrame
#explore_dataframe(subjects_df)
#explore_dataframe(comps_df)
#explore_dataframe(candidates_df)

## Cell 7: Analyze Missing Values in DataFrames

This function quickly summarizes missing values for any DataFrame.

**What it does:**
- Calculates how many values are missing in each column and the percent missing.
- Only shows columns that actually have missing data, sorted by percent missing.
- Prints a quick summary, then returns a table of all columns with missing data.

---

**Example usage:**
```python
missing_summary_subject = analyze_missing_values(subjects_df)
missing_summary_comps = analyze_missing_values(comps_df)
missing_summary_candidates = analyze_missing_values(candidates_df)


In [40]:
def analyze_missing_values(df):
    """
    Finds and reports missing values in the DataFrame
    """
    # Calculate missing values
    missing_count = df.isnull().sum()
    missing_percent = (missing_count / len(df)) * 100
    
    # Create a summary DataFrame
    missing_df = pd.DataFrame({
        'Column': missing_count.index,
        'Missing_Count': missing_count.values,
        'Missing_Percent': np.round(missing_percent.values, 1)
    })
    
    # Filter to show only columns with missing values
    missing_df = missing_df[missing_df['Missing_Count'] > 0]
    missing_df = missing_df.sort_values('Missing_Percent', ascending=False)
    
    print("‚ùì MISSING VALUES ANALYSIS")
    print("=" * 50)
    print(f"Columns with missing values: {len(missing_df)} out of {len(df.columns)}")
    print(f"\nTop {len(missing_df)} columns with most missing values:")
    
    return missing_df.head(len(missing_df))




In [41]:
# Check missing values
missing_summary_subject = analyze_missing_values(subjects_df)
missing_summary_subject


‚ùì MISSING VALUES ANALYSIS
Columns with missing values: 25 out of 36

Top 25 columns with most missing values:


Unnamed: 0,Column,Missing_Count,Missing_Percent
3,municipality_district,1,1.1
20,plumbing_lines,1,1.1
30,third_lvl_area,1,1.1
29,second_lvl_area,1,1.1
28,main_lvl_area,1,1.1
27,room_total,1,1.1
26,num_beds,1,1.1
25,room_count,1,1.1
24,cooling,1,1.1
23,water_heater,1,1.1


In [42]:
missing_summary_comps = analyze_missing_values(comps_df)
missing_summary_comps

‚ùì MISSING VALUES ANALYSIS
Columns with missing values: 3 out of 20

Top 3 columns with most missing values:


Unnamed: 0,Column,Missing_Count,Missing_Percent
1,prop_type,3,1.1
7,dom,3,1.1
8,location_similarity,3,1.1


In [43]:
missing_summary_candidates = analyze_missing_values(candidates_df)
missing_summary_candidates

‚ùì MISSING VALUES ANALYSIS
Columns with missing values: 22 out of 29

Top 22 columns with most missing values:


Unnamed: 0,Column,Missing_Count,Missing_Percent
16,bg_fin_area,9820,100.0
15,upper_lvl_fin_area,8418,85.7
14,main_level_finished_area,7064,71.9
13,half_baths,6272,63.9
17,lot_size_sf,4876,49.7
18,year_built,4026,41.0
12,full_baths,3424,34.9
19,roof,623,6.3
10,levels,213,2.2
3,gla,176,1.8


## Cell 8: Save DataFrames to CSV Files

This function saves your subjects, comps, and candidates DataFrames as CSV files inside a `data` folder.

**What it does:**
- Checks if a `data` folder exists; creates it if needed.
- Saves each DataFrame (`subjects_df`, `comps_df`, `candidates_df`) to its own CSV file.
- Prints a confirmation with file names and a reminder that you can open the CSVs in Excel.

---

**Example usage:**
```python
save_data_to_csv(subjects_df, comps_df, candidates_df)


In [44]:
def save_data_to_csv(subjects_df, comps_df, candidates_df):
    """
    Saves our DataFrames to CSV files
    """
    
    # Save individual DataFrames
    subjects_df.to_csv('../data/raw/subjects_raw.csv', index=False)
    comps_df.to_csv('../data/raw/comps_raw.csv', index=False)
    candidates_df.to_csv('../data/raw/candidates_raw.csv', index=False)
    
    print("üíæ Data saved to CSV files in 'data' folder:")
    print("   - subjects_raw.csv (88 subjects)")
    print("   - comps_raw.csv (264 comps)")
    print("   - candidates_raw.csv (all candidates)")
    print("\nüìå You can open these in Excel to explore!")

# Save the data
save_data_to_csv(subjects_df, comps_df, candidates_df)

üíæ Data saved to CSV files in 'data' folder:
   - subjects_raw.csv (88 subjects)
   - comps_raw.csv (264 comps)
   - candidates_raw.csv (all candidates)

üìå You can open these in Excel to explore!
