# Redfin Listing Price Predictor

## Instructions

Go To Redfin.com and find the listed property you want to predict 
- 1. Navigate to Cell in the Toolbar
- 2. Select Run All
- 3. Input House Information 
- 4. Copy and Paste House Description 
- 5. Input House Listing Price 
- 6. Scroll to the Bottom to Get Predictions

- PROPERTY TYPE INPUT OPTIONS (CASE SENSITIVE) 
    - Single Family Residential
    - Condo/Co-op	
    - Townhouse

In [164]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import spacy 
import re
from gensim.models.phrases import Phraser, Phrases
import pickle
from datetime import datetime
import time

In [165]:
current_year = datetime.now().year
current_month = datetime.now().month
current_quarter = (current_month - 1) // 3 + 1  # Calculate the quarter based on the current month
current_date = datetime.now().date()

columns = [
    "Year",
    "Quarter",
    "YEAR BUILT",
    "SALE TYPE",
    "SOLD DATE",
    "PROPERTY TYPE",
    "ZIP OR POSTAL CODE",
    "BEDS",
    "BATHS",
    "LOCATION",
    "SQUARE FEET",
    "LOT SIZE", 
    "Description"
]

df = pd.DataFrame(columns=columns)
for column in columns:
    if column == "Year":
        user_input = current_year
    elif column == "Quarter":
        user_input = current_quarter
    elif column == "SOLD DATE":
        user_input = current_date 
    elif column == "SALE TYPE":
        user_input = "PAST SALE"  
    else:
        user_input = input(f"Enter {column}: ")
    df.at[0, column] = user_input
price = input(f"Enter Price")


Enter YEAR BUILT: 2008
Enter PROPERTY TYPE: Single Family Residential
Enter ZIP OR POSTAL CODE: 98136
Enter BEDS: 4
Enter BATHS: 3.5
Enter LOCATION: Seaview
Enter SQUARE FEET: 4040
Enter LOT SIZE: 5000
Enter Description: This contemporary west facing home in the coveted neighborhood of Seaview offers over 4000 sq ft, tons of natural light t/o, 3 bedrooms on one level, spacious primary bedroom w/ 5 piece bathroom & so much more. Main level features high ceilings, bamboo floors, formal dining room, living room, 2 gas fireplaces, family room, eat-in chef's kitchen w/ SS high end appliances. The basement is a dream w/bar, 9 ft ceilings, plenty of room for ping pong or pool table, large TV, ample storage, 4th bedroom, 3/4 bath. Home boasts low maintenance yard, private backyard w/ plenty of room for BBQ & pizza oven, basketball court which leads to 2 car garage +2 car parking pad off of paved alley. Close to all that West Seattle has to offer!
Enter Price1799950


In [166]:
df["Age"] = 2023 - df["YEAR BUILT"].astype("int")
df["Year"] = df["Year"].astype("int")
df["ZIP OR POSTAL CODE"] = df["ZIP OR POSTAL CODE"].astype("int")
df["BEDS"] = df["BEDS"].astype("float")
df["BATHS"] = df["BATHS"].astype("float")
df["SQUARE FEET"] = df["SQUARE FEET"].astype("int")
df["LOT SIZE"] = df["LOT SIZE"].astype("int")
df["Quarter"] = df["Quarter"].astype("int")
df["SOLD DATE"] = pd.to_datetime(df["SOLD DATE"])
df["ZIP OR POSTAL CODE"] = df["ZIP OR POSTAL CODE"].astype("int")

In [167]:
df.drop("YEAR BUILT", axis =1)

Unnamed: 0,Year,Quarter,SALE TYPE,SOLD DATE,PROPERTY TYPE,ZIP OR POSTAL CODE,BEDS,BATHS,LOCATION,SQUARE FEET,LOT SIZE,Description,Age
0,2023,4,PAST SALE,2023-11-15,Single Family Residential,98136,4.0,3.5,Seaview,4040,5000,This contemporary west facing home in the cove...,15


In [168]:
df ["Description"] = df["Description"].str.lower()

In [169]:
nlp = spacy.load('en_core_web_sm')
def lemmatize(text): 
    doc = nlp(text)
    lemmatized_text = " ".join([token.lemma_ for token in doc])
    return lemmatized_text

In [170]:
df["Description"] = df["Description"].apply(lemmatize)

In [171]:
stop_words = set(stopwords.words('english'))

In [172]:
df["Description"] = df["Description"].apply(word_tokenize)

In [173]:
clean_words = []
for tokenized_description in df["Description"]:
    cleaned_tokens = [token for token in tokenized_description if token not in stop_words]
    clean_words.append(cleaned_tokens)

In [174]:
df["Description"] = clean_words

In [175]:
def clean_tokens(tokens):
    cleaned_tokens = []
    for token in tokens:
        cleaned_token = re.sub(r'[^a-zA-Z0-9]', '', token)
        if cleaned_token:
            cleaned_tokens.append(cleaned_token)
    return cleaned_tokens



In [176]:
df["Description"] = df["Description"].apply(clean_tokens)

In [177]:
with open('Pickled Models/bigram_model.pkl', 'rb') as f:
    bigram = pickle.load(f)

In [178]:
df['Description'] = df['Description'].apply(lambda tokens: ' '.join(bigram[tokens]))

In [179]:
with open('Pickled Models/vectorizer.pkl', 'rb') as c:
    vectorizer = pickle.load(c)

In [180]:
transformed_df = vectorizer.transform(df["Description"])

In [181]:
df_bow = pd.DataFrame(transformed_df.toarray(), columns=vectorizer.get_feature_names_out())

In [182]:
df.reset_index(drop=True, inplace=True)
df_bow.reset_index(drop=True, inplace=True)

In [183]:
df_combined = pd.concat([df.drop('Description', axis=1), df_bow], axis=1)

In [184]:
df_combined = pd.concat([df.drop('Description', axis=1), df_bow], axis=1)

In [185]:
with open('Pickled Models/XGBPipeline.pkl', 'rb') as z:
    bestmodel = pickle.load(z)

In [186]:
prediction = bestmodel.predict(df_combined)

In [187]:
price = int(price)

In [188]:
prediction[0]

1897122.2

In [189]:
percentage_error = ((price - prediction[0]) / price) * 100

In [190]:
print(f"Predicted Sale Price: ${prediction[0]:.0f}")
print(f"Listing vs. Predicted Sale Price Error: ${prediction[0] - price:.0f}")
print(f"Percentage Error: {percentage_error:.2f}%")

Predicted Sale Price: $1897122
Listing vs. Predicted Sale Price Error: $97172
Percentage Error: -5.40%
