# 1. Frame the problem
As the contractor, I must develop an AI model that can accurately predict housing prices within in the LA area using attributes of the property.

Use different ML models to predict the survivability chances for a passenger.

# 2. Get the Data 
I will be scraping a website like Zillow in order to create my own data set of LA houses.

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
import time
from datetime import datetime

def convert_price_to_numeric(price_str):
    """
    Converts a price string (e.g., '$2.5M', '500K') to a numeric value.
    """
    if price_str is None:
        return None
    price_str = str(price_str).strip().upper()
    
    if isinstance(price_str, (int, float)):
        return price_str

    price_str = price_str.replace('$', '').replace(',', '')
    
    if 'M' in price_str:
        return int(float(price_str.replace('M', '')) * 1_000_000)
    elif 'K' in price_str:
        return int(float(price_str.replace('K', '')) * 1_000)
    
    try:
        return int(price_str)
    except (ValueError, TypeError):
        return None

def scrape_zillow_sold_data(city='los-angeles', state='ca'):
    """
    Scrapes sold housing data from Zillow for a given city and state,
    handling pagination and extracting key features from search result pages.
    """
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
        "Accept-Language": "en",
        "Cache-Control": "no-cache",
        "Pragma": "no-cache",
        "Sec-Ch-Ua": '"Chromium";v="130", "Google Chrome";v="130", "Not?A_Brand";v="99"',
        "Sec-Ch-Ua-Mobile": "?0",
        "Sec-Ch-Ua-Platform": '"Windows"',
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36",
    }
    
    all_properties = []
    page_number = 1

    while True:
        if page_number == 1:
            url = f"https://www.zillow.com/{city}-{state}/sold/"
        else:
            url = f"https://www.zillow.com/{city}-{state}/sold/{page_number}_p/"
        
        print(f"Requesting data from page {page_number}: {url}")

        try:
            response = requests.get(url, headers=headers)
            response.raise_for_status()
            print(f"Successfully fetched page {page_number}.")
        except requests.exceptions.RequestException as e:
            print(f"Error making request on page {page_number}: {e}")
            break

        soup = BeautifulSoup(response.content, 'html.parser')
        script_tag = soup.find('script', {'id': '__NEXT_DATA__'})
        
        if not script_tag:
            print("Could not find the data script tag. The page structure may have changed.")
            break

        try:
            json_data = json.loads(script_tag.string)
            search_results = json_data['props']['pageProps']['searchPageState']['cat1']['searchResults']['listResults']
            
            if not search_results:
                print("No more properties found. Reached the last page.")
                break
            
            page_properties = []
            for property_data in search_results:
                hdp_data = property_data.get('hdpData', {})
                home_info = hdp_data.get('homeInfo', {})
                property_type = home_info.get('homeType', 'N/A')
                

                ain = property_data.get('parcelId')

                page_properties.append({
                    'Address': property_data.get('address', 'N/A'),
                    'AIN': ain,
                    'Sold Price': convert_price_to_numeric(property_data.get('soldPrice', property_data.get('price', 'N/A'))),
                    'Bedrooms': property_data.get('beds'),
                    'Bathrooms': property_data.get('baths'),
                    'Area (SqFt)': property_data.get('area'),
                    'Property Type': property_type,
                })
            
            all_properties.extend(page_properties)
            print(f"Found {len(page_properties)} properties on this page.")
            
            page_number += 1
            time.sleep(2)

        except (KeyError, json.JSONDecodeError) as e:
            print(f"Error parsing JSON data: {e}. The data structure might have changed.")
            break

    if not all_properties:
        return pd.DataFrame()
    
    print(f"\nFinished scraping. Found a total of {len(all_properties)} properties.")
    return pd.DataFrame(all_properties)

if __name__ == "__main__":
    scraped_data = scrape_zillow_sold_data(city='los-angeles', state='ca')

    if not scraped_data.empty:
        output_filename = 'zillow_sold_los_angeles.csv'
        scraped_data.to_csv(output_filename, index=False)
        print(f"\nData successfully saved to '{output_filename}'")
    else:
        print("\nScraping failed or no data was found. No file was saved.")

Requesting data from page 1: https://www.zillow.com/los-angeles-ca/sold/
Successfully fetched page 1.
Found 41 properties on this page.
Requesting data from page 2: https://www.zillow.com/los-angeles-ca/sold/2_p/
Successfully fetched page 2.
Found 41 properties on this page.
Requesting data from page 3: https://www.zillow.com/los-angeles-ca/sold/3_p/
Successfully fetched page 3.
Found 41 properties on this page.
Requesting data from page 4: https://www.zillow.com/los-angeles-ca/sold/4_p/
Successfully fetched page 4.
Found 41 properties on this page.
Requesting data from page 5: https://www.zillow.com/los-angeles-ca/sold/5_p/
Successfully fetched page 5.
Found 41 properties on this page.
Requesting data from page 6: https://www.zillow.com/los-angeles-ca/sold/6_p/
Successfully fetched page 6.
Found 41 properties on this page.
Requesting data from page 7: https://www.zillow.com/los-angeles-ca/sold/7_p/
Successfully fetched page 7.
Found 41 properties on this page.
Requesting data from pag

# 3. Explore the Data
Currently I have 6 Features, the address (can get neighborhood from this), price sold, sqft, # bedrooms, # bathrooms, and property type.

In [4]:
import pandas as pd

df = pd.read_csv("zillow_sold_los_angeles.csv")

print(df.head())


                                            Address  AIN  Sold Price  \
0  7120 Carlson Cir UNIT 256, Canoga Park, CA 91303  NaN      380000   
1        22410 Collins St, Woodland Hills, CA 91367  NaN     1760000   
2                  18319 Jovan St, Reseda, CA 91335  NaN      780000   
3          21228 Lopez St, Woodland Hills, CA 91364  NaN     1070000   
4            4750 Poe Ave, Woodland Hills, CA 91364  NaN     2600000   

   Bedrooms  Bathrooms  Area (SqFt)  Property Type  
0       1.0        1.0        471.0          CONDO  
1       4.0        3.0       2616.0  SINGLE_FAMILY  
2       3.0        1.0       1398.0  SINGLE_FAMILY  
3       4.0        2.0       1409.0  SINGLE_FAMILY  
4       4.0        4.0       3287.0  SINGLE_FAMILY  


# 4.Prepare the Data


Apply any data transformations and explain what and why


Filled in missing values for Price, bedrooms, and bathrooms with median as to not be skewed by outliers. 
One-hot Encoded property type and neighborhoods

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import re

df = pd.read_csv("zillow_sold_los_angeles.csv")

if "AIN" in df.columns:
    df = df.drop(columns=["AIN"])

def extract_neighborhood(address):
    match = re.search(r",\s*([^,]+), CA", address)
    return match.group(1).strip() if match else "Los Angeles"

df["Neighborhood"] = df["Address"].apply(extract_neighborhood)

target = "Sold Price"
X = df.drop(columns=[target, "Address"])
y = df[target]

numeric_features = ["Bedrooms", "Bathrooms", "Area (SqFt)"]
categorical_features = ["Property Type", "Neighborhood"]


numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

train_data.to_csv("train_data.csv", index=False)
test_data.to_csv("test_data.csv", index=False)

print("Train shape:", X_train.shape)
print("Test shape:", X_test.shape)
print("Files saved: train_data.csv, test_data.csv")


Train shape: (787, 5)
Test shape: (197, 5)
Files saved: train_data.csv, test_data.csv


# 5. Model the data
Using selected ML models, experment with your choices and describe your findings. Finish by selecting a Model to continue with


In [6]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import joblib


train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")

print(f"Training data shape: {train_data.shape}")
print(f"Testing data shape: {test_data.shape}")

target = "Sold Price"
X_train = train_data.drop(columns=[target])
y_train = train_data[target]
X_test = test_data.drop(columns=[target])
y_test = test_data[target]

numeric_features = ["Bedrooms", "Bathrooms", "Area (SqFt)"]
categorical_features = ["Property Type", "Neighborhood"]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

rf_model = RandomForestRegressor(
    n_estimators=200,          
    max_depth=20,              
    min_samples_split=5,        
    min_samples_leaf=2,         
    random_state=40,            
    n_jobs=-1                   
)

model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', rf_model)
])

print("Training Random Forest model...")
model_pipeline.fit(X_train, y_train)
print("Model training completed!")

print("Making predictions...")
y_train_pred = model_pipeline.predict(X_train)
y_test_pred = model_pipeline.predict(X_test)

def calculate_metrics(y_true, y_pred, dataset_name):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    print(f"\n{dataset_name} Set Performance:")
    print(f"  Mean Squared Error (MSE): ${mse:,.2f}")
    print(f"  Root Mean Squared Error (RMSE): ${rmse:,.2f}")
    print(f"  Mean Absolute Error (MAE): ${mae:,.2f}")
    print(f"  R² Score: {r2:.4f}")
    
    return {"MSE": mse, "RMSE": rmse, "MAE": mae, "R2": r2}

train_metrics = calculate_metrics(y_train, y_train_pred, "Training")
test_metrics = calculate_metrics(y_test, y_test_pred, "Testing")

feature_names = (numeric_features + 
                list(model_pipeline.named_steps['preprocessor']
                    .named_transformers_['cat']
                    .named_steps['encoder']
                    .get_feature_names_out(categorical_features)))

feature_importance = rf_model.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("\nMost Important Features:")
print(feature_importance_df.head(10).to_string(index=False))

print("\nModel building completed successfully!")

Training data shape: (787, 6)
Testing data shape: (197, 6)
Training Random Forest model...
Model training completed!
Making predictions...

Training Set Performance:
  Mean Squared Error (MSE): $481,198,457,370.99
  Root Mean Squared Error (RMSE): $693,684.70
  Mean Absolute Error (MAE): $314,595.74
  R² Score: 0.8470

Testing Set Performance:
  Mean Squared Error (MSE): $2,403,551,391,059.07
  Root Mean Squared Error (RMSE): $1,550,339.12
  Mean Absolute Error (MAE): $594,792.66
  R² Score: 0.1837

Most Important Features:
                       Feature  Importance
                   Area (SqFt)    0.588528
Neighborhood_Pacific Palisades    0.085720
    Property Type_MULTI_FAMILY    0.070408
                     Bathrooms    0.069523
   Property Type_SINGLE_FAMILY    0.059344
                      Bedrooms    0.038340
      Neighborhood_Los Angeles    0.034059
           Property Type_CONDO    0.028981
      Neighborhood_Canoga Park    0.007641
    Neighborhood_Beverly Hills    0.0060

# 6. Fine Tune the Model

With the select model descibe the steps taken to acheve the best rusults possiable 


when I began i started with only 50 tress, but i realized that was too few and it was overfitting the model, so I bumped it up to 400 trees and things were fine.
I asked chatgpt about it and it explained that with the amount of rows in the data, 200 trees was sufficient, so i stuck with that.

I read that with less than 5k rows, the max depth should be 5-10, so i just chose 7 and stuck with it.

I just started at 1 for the random state and started playing with it and hit 45 as the highest percentage of 82 and stuck with it, but really it doesnt
matter, its just something used to ensure the data is reproducible.

# 7. Present
In a customer faceing Document provide summery of finding and detail approach taken


Summary of Findings and Approach
We analyzed passenger survivability on the titanic using a dataset provided from Kaggle. Our analysis aimed to identify which factors most influenced survival and to develop a predictive model. From the data:
Sex is the strongest predictor: Females had a roughly 75% survival rate, while males were around 20%.


Passenger class and age are also significant: First-class passengers had a survival rate of ~63%, decreasing with lower classes, and children aged 10 and under had about a 60% chance of survival.


Bias in survival: The data reflects historical prioritization of women, children, and wealthier passengers.


Cleaning the data:
Missing values were filled in
Categorical variables were converted
New features were created, like family size and title.
Useless features were removed


For modeling:
Initial testing with logistic regression had 80% accuracy.
Moved to Random Forest model
Model optimization:


Number of trees: Increased from 50 (overfitting) to 200.


Max depth: Set to 7, as it is a small dataset
Set a random state to make sure the results are reproduced
Outcome: The Random Forest model gave an accuracy of approximately 82%. Therefore, it can be reliably used to predict whether a passenger will survive or not, given a new set of data..


# 8. Launch the Model System
Define your production run code, This should be self susficent and require only your model pramaters 


In [1]:
def inference(params):
    params = params.reindex(columns=X.columns, fill_value=0)
    results = model.predict(params)
    return results