# Redfin Listing Price Predictor

## Instructions

Go To Redfin.com and find the listed property you want to predict 
- 1. Navigate to Cell in the Toolbar
- 2. Select Run All
- 3. Input House Information 
- 4. Copy and Paste House Description 
- 5. Input House Listing Price 
- 6. Scroll to the Bottom to Get Predictions

- PROPERTY TYPE INPUT OPTIONS (CASE SENSITIVE) 
    - Single Family Residential
    - Condo/Co-op	
    - Townhouse

In [32]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import spacy 
import re
from gensim.models.phrases import Phraser, Phrases
import pickle
from datetime import datetime
import time

In [33]:
current_year = datetime.now().year
current_month = datetime.now().month
current_quarter = (current_month - 1) // 3 + 1  # Calculate the quarter based on the current month
current_date = datetime.now().date()

columns = [
    "Year",
    "Quarter",
    "YEAR BUILT",
    "SALE TYPE",
    "SOLD DATE",
    "PROPERTY TYPE",
    "ZIP OR POSTAL CODE",
    "BEDS",
    "BATHS",
    "LOCATION",
    "SQUARE FEET",
    "LOT SIZE", 
    "Description"
]

df = pd.DataFrame(columns=columns)
for column in columns:
    if column == "Year":
        user_input = current_year
    elif column == "Quarter":
        user_input = current_quarter
    elif column == "SOLD DATE":
        user_input = current_date 
    elif column == "SALE TYPE":
        user_input = "PAST SALE"  
    else:
        user_input = input(f"Enter {column}: ")
    df.at[0, column] = user_input
price = input(f"Enter Price")


Enter YEAR BUILT: 1982
Enter PROPERTY TYPE: Townhouse
Enter ZIP OR POSTAL CODE: 98199
Enter BEDS: 3
Enter BATHS: 3
Enter LOCATION: Magnolia
Enter SQUARE FEET: 4900
Enter LOT SIZE: 5000
Enter Description: Luxury living in Magnolia's coveted West hill with breathtaking panoramic views of Puget Sound & the Olympics from 3 levels. Natural light streams through the many windows highlighting the open floorplan & well-designed flow of the home. Whether you're hosting in one of many outdoor spaces or nestling by one of four fireplaces, this residence embodies opulent comfort & scenic splendor. Spacious kitchen, gleaming hardwoods, a 2nd living room & main-floor office. Upstairs are 3 bedrooms, including an expansive Primary Suite w/ 5-piece bath, a grand room w/ vaulted ceilings & AC for comfort. Top-of-the-world vistas and awe-inspiring sunsets from the roof deck. Downstairs a versatile rec/media room w/ wet bar. 3-car garage. Unbeatable location!
Enter Price2850000


In [34]:
df["Age"] = 2023 - df["YEAR BUILT"].astype("int")
df["Year"] = df["Year"].astype("int")
df["ZIP OR POSTAL CODE"] = df["ZIP OR POSTAL CODE"].astype("int")
df["BEDS"] = df["BEDS"].astype("float")
df["BATHS"] = df["BATHS"].astype("float")
df["SQUARE FEET"] = df["SQUARE FEET"].astype("int")
df["LOT SIZE"] = df["LOT SIZE"].astype("int")
df["Quarter"] = df["Quarter"].astype("int")
df["SOLD DATE"] = pd.to_datetime(df["SOLD DATE"])
df["ZIP OR POSTAL CODE"] = df["ZIP OR POSTAL CODE"].astype("int")

In [35]:
df.drop("YEAR BUILT", axis =1)

Unnamed: 0,Year,Quarter,SALE TYPE,SOLD DATE,PROPERTY TYPE,ZIP OR POSTAL CODE,BEDS,BATHS,LOCATION,SQUARE FEET,LOT SIZE,Description,Age
0,2023,4,PAST SALE,2023-11-15,Townhouse,98199,3.0,3.0,Magnolia,4900,5000,Luxury living in Magnolia's coveted West hill ...,41


In [36]:
df ["Description"] = df["Description"].str.lower()

In [37]:
nlp = spacy.load('en_core_web_sm')
def lemmatize(text): 
    doc = nlp(text)
    lemmatized_text = " ".join([token.lemma_ for token in doc])
    return lemmatized_text

In [38]:
df["Description"] = df["Description"].apply(lemmatize)

In [39]:
stop_words = set(stopwords.words('english'))

In [40]:
df["Description"] = df["Description"].apply(word_tokenize)

In [41]:
clean_words = []
for tokenized_description in df["Description"]:
    cleaned_tokens = [token for token in tokenized_description if token not in stop_words]
    clean_words.append(cleaned_tokens)

In [42]:
df["Description"] = clean_words

In [43]:
def clean_tokens(tokens):
    cleaned_tokens = []
    for token in tokens:
        cleaned_token = re.sub(r'[^a-zA-Z0-9]', '', token)
        if cleaned_token:
            cleaned_tokens.append(cleaned_token)
    return cleaned_tokens



In [44]:
df["Description"] = df["Description"].apply(clean_tokens)

In [45]:
with open('Pickled Models/bigram_model.pkl', 'rb') as f:
    bigram = pickle.load(f)

In [46]:
df['Description'] = df['Description'].apply(lambda tokens: ' '.join(bigram[tokens]))

In [47]:
with open('Pickled Models/vectorizer.pkl', 'rb') as c:
    vectorizer = pickle.load(c)

In [48]:
transformed_df = vectorizer.transform(df["Description"])

In [49]:
df_bow = pd.DataFrame(transformed_df.toarray(), columns=vectorizer.get_feature_names_out())

In [50]:
df.reset_index(drop=True, inplace=True)
df_bow.reset_index(drop=True, inplace=True)

In [51]:
df_combined = pd.concat([df.drop('Description', axis=1), df_bow], axis=1)

In [52]:
df_combined = pd.concat([df.drop('Description', axis=1), df_bow], axis=1)

In [53]:
with open('Pickled Models/XGBPipeline.pkl', 'rb') as z:
    bestmodel = pickle.load(z)

In [54]:
prediction = bestmodel.predict(df_combined)

In [55]:
price = int(price)

In [56]:
prediction[0]

2094230.0

In [57]:
percentage_error = ((price - prediction[0]) / price) * 100

In [58]:
print(f"Predicted Sale Price: ${prediction[0]:.0f}")
print(f"Listing vs. Predicted Sale Price Error: ${prediction[0] - price:.0f}")
print(f"Percentage Error: {percentage_error:.2f}%")

Predicted Sale Price: $2094230
Listing vs. Predicted Sale Price Error: $-755770
Percentage Error: 26.52%
