# Final Project: House Price Prediction
## Corpus Christi Team
### Step 1 (Data Collection)

The data is obtained from _Zillow.com_

The data is requested through API calls using _RapidAPI_

**MAKE SURE TO INPUT YOUR RapiAPI key**

#### Import requested libraries

In [1]:
import pandas as pd
import requests
import json
import re
import os
import warnings

warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

# Import your RapidAPI key
from config import krapid as key

#### Define the search filters

**NOTE:** For the min/max price please use these values:

* From 100,000 to 250,000 (Hector)
* From 250,001 to 300,000 (Nino)
* From 300,001 to 375,000 (Oscar)
* From 375,001 to 500,000 (Neeraja)
* From 500,001 to 900,000 (Dianabasi)

**ADDITIONAL NOTES:** For the min/max price please use these values:

* Each person should be able to make 20 API calls for free.
* Each API call to an endpoint returns data from 40 houses.
* In total each person should retrieve data from about 800 houses (20 pages)
* We will all gather information from ~4000 houses in Houston, TX.

In [11]:
# API specific (do not modify)
host = "zillow-com1.p.rapidapi.com"
url = "https://" + host + "/propertyExtendedSearch"

# Search filters
city = "Houston"
state = "TX"
location = city + ", " + state
homeType = "Houses"

# REMEMBER TO MODIFY THESE 3 VARIABLES
page = 2 # Iterate from 1 to 20
minPrice = 300001 # Use your assigned min price
maxPrice = 375000 # Use your assigned max price

sort = "Price_Low_High"
minSqft = "1,000 sqft"
maxSqft = "1/4 acre/10,890 sqft"

#### Output Parameters

In [12]:
# Create output directory if it does not exist
os.makedirs('./data/', exist_ok=True)

# Output file name
out_name = f"data_{city}_{state}_{homeType}_p{page}_price_{minPrice}_{maxPrice}"

#### Perform the query

In [13]:
querystring = {"location":location,"page":page,"home_type":homeType,"sort": sort,"minPrice":minPrice,"maxPrice":maxPrice,"lotSizeMin":minSqft,"lotSizeMax":maxSqft}

headers = {
	"X-RapidAPI-Key": key,
	"X-RapidAPI-Host": host
}

response = requests.request("GET", url, headers=headers, params=querystring)
json_response = response.json()
print(response) # If 200, it was successful

<Response [200]>


#### Initial basic QC (optional)

In [None]:
# Run this cell to view the response
#json_response # Check the retrieved data
#json_response.keys() # Check the keys of the json response
#json_response['props'][0] # Display the first house data

#### Save the raw response

In [14]:
# Write json response to a text file (Optional as a backup)
f = open(f'./data/{out_name}.txt', "w")
f.write(response.text)
f.close()

In [8]:
# Import the text file (Only if need to begin from this point)
#f = open(f'./data/{out_name}.txt', 'r')
#content = f.read()
#jsonImported = json.loads(content)
#jsonImported

#### Create a clean real estate data DF

In [15]:
cnames = ['Page', 'Item', 'zid', 'State', 'City', 'Number', 'Street', 'zipCode', 'Lat', 'Lng', 'Price', 'Image', 'Bedrooms', 'Bathrooms', 'Status', 'daysOnSale']
df = pd.DataFrame(columns=cnames)

for i in range(len(json_response['props'])):
    page = json_response['currentPage']
    item = i+1
    zid = json_response['props'][i]['zpid']
    address = json_response['props'][i]['address']
    lat = json_response['props'][i]['latitude']
    lng = json_response['props'][i]['longitude']
    price = json_response['props'][i]['price']
    image = json_response['props'][i]['imgSrc']
    bedrooms = json_response['props'][i]['bedrooms']
    bathrooms = json_response['props'][i]['bathrooms']
    status = json_response['props'][i]['listingStatus']
    daysOnSale = json_response['props'][i]['daysOnZillow']
    
    # Regular expressions to breakdown the address into house number, street and ZIP code.
    address_num_regex = "^\d+"
    address_num_match = re.findall(rf"{address_num_regex}", address)
    address_num = ''.join(address_num_match)

    address_st_regex = "^\d+\s(.+),\s" + city
    address_st_match = re.findall(rf"{address_st_regex}", address)
    address_st = ''.join(address_st_match)

    address_zip_regex = state + ".(\d{5})"
    address_zip_match = re.findall(rf"{address_zip_regex}", address)
    address_zip = ''.join(address_zip_match)
    
    df_row = {'Page': page, 'Item': item, 'zid': zid, 'State': state, 'City': city, 'Number': address_num, 'Street': address_st, 'zipCode': address_zip, 'Lat': lat, 'Lng': lng, 'Price': price, 'Image': image, 'Bedrooms': bedrooms, 'Bathrooms': bathrooms, 'Status': status, 'daysOnSale': daysOnSale}
    df = df.append(df_row, ignore_index=True)

df.head(5)

Unnamed: 0,Page,Item,zid,State,City,Number,Street,zipCode,Lat,Lng,Price,Image,Bedrooms,Bathrooms,Status,daysOnSale
0,2,1,27853008,TX,Houston,600,E 40th 1/2 St,77022,29.822659,-95.39235,309000,https://photos.zillowstatic.com/fp/819785428f5...,3,2,FOR_SALE,-1
1,2,2,28078321,TX,Houston,6215,Great Oaks Dr,77050,29.903404,-95.29729,309000,https://photos.zillowstatic.com/fp/61a6e89f54b...,4,3,FOR_SALE,-1
2,2,3,28150303,TX,Houston,5930,Lodge Creek Dr,77066,29.974018,-95.51575,309000,https://photos.zillowstatic.com/fp/522c7c7af33...,5,5,FOR_SALE,-1
3,2,4,28377473,TX,Houston,14822,James River Ln,77084,29.865757,-95.630775,309000,https://photos.zillowstatic.com/fp/c3f1f3e01c7...,4,3,FOR_SALE,-1
4,2,5,28444756,TX,Houston,12006,Hadley Falls Ct,77067,29.959593,-95.45223,309000,https://photos.zillowstatic.com/fp/a7fd0fca1de...,3,2,FOR_SALE,-1


#### Save the clean DF

In [16]:
# Save the to a csv file
df.to_csv(f'./data/{out_name}.csv', index=False)