# **Zillow Housing data Analysis**

**Scraping**
1. Use beautiful soup to begin collecting data
2. Get 10 observations per state in order to get a general idea of pricing distribution
3. Collect necessary information from each property, handling missing values
4. Gather price, address, square feet, number of rooms, and number of bathrooms
5. Store data for each state in separate dataframes, merging them upon completion of webscraping

--------------------------------------------------------------------------------

**Problems faced**
Since Zillow's website discourages scraping, we must adapt by delaying time between scrapes and varying user agents. This notebook shows the methods used to scrape the website. To continue building our data, we needed to use multiple computers for scraping using the afformentioned strategy. We also needed to limit 10 requests per request. To account for these issues, we gathered 10 homes from each state and used this data to make estimates of housing information.

**Imports required**

In [None]:
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import time
import random
import time

**User-Agent**

In [None]:
headers = {'User-Agent': 'Launch webscraping project (user-name@gmail.com)'}

**Access_link**:
- Input: City, State abbreviation, user-agent
- Output: A response from the webpage, assuming the request is valid, we should receive a response 200

In [None]:
def access_link (city,state, header_in):
  city = city.replace(" ","-")
  url = f"https://www.zillow.com/homes/{city},-{state}_rb/"
  time.sleep(random.randint(5,16))
  response = requests.get(url,headers=header_in)
  return response

**Data collection process**

In [None]:
# Example using NY, NY
response = access_link("New York", "NY", headers)

In [None]:
data = BeautifulSoup(response.text,'html')
# get each property card
search_results = data.find("div", id="grid-search-results")
properties = search_results.find_all("li")

In [None]:
len(properties)
for p in properties[:3]:
  print(p)

In [None]:
with open('html.txt','w') as f:
  f.write(response.text)

In [None]:
response = access_link("New York", "NY", headers)

**Get Prices**:
- Input: valid response associated with your initial query
- Output: A list of prices associated with homes given the location specified

In [None]:
def get_prices(response):
  data = BeautifulSoup(response.text,'html')
  # find address
  time.sleep(random.randint(5,16))
  price = data.find_all('div', class_ = "srp__sc-16e8gqd-0 gKmVGs")
  prices=[]
  for i in price:
    x = i.text.strip().replace("$",'')
    x = x.replace(',','')
    prices.append(x)
  return prices

In [None]:
get_prices(response)

**get_addresses**
- input: A valid response from your desired location
- output: Returns a list of addresses associated with the homes in the location specified by the user

In [None]:
def get_addresses(response):
  data = BeautifulSoup(response.text,'html')
  time.sleep(random.randint(5,16))
  info = data.find_all('div', class_ = "StyledPropertyCardDataWrapper-c11n-8-85-1__sc-1omp4c3-0 jVBMsP property-card-data")
  lis=[]
  for i in info:
    time.sleep(random.randint(5,16))
    lis.append(i.find('address'))
  address=[]
  for i in lis:
    address.append(i.text.strip())
  return address

In [None]:
get_addresses(response)

**get_housing_info**
- input: A valid response from your desired location
- output: Returns a dictionary where the keys represent bed, bath, and sqft information associated with each home at the specified location. Each key in the dictionary has a list of the values pulled from the webpage.

In [None]:
def get_housing_info(response):
  data = BeautifulSoup(response.text,'html')
  time.sleep(random.randint(5,16))
  info = data.find_all('b')
  lis =[]
  count = 0
  for i in info:
    lis.append(i.text.strip())
  dic={}
  dic['bed']=[]
  dic['bath']=[]
  dic['sqft']=[]
  for i in range(0,len(lis),3):
      dic['bed'].append(lis[i])

  for i in range(1,len(lis),3):
      dic['bath'].append(lis[i])

  for i in range(2,len(lis),3):
      dic['sqft'].append(lis[i])
  return dic


In [None]:
# Example run with Dover DE
response = access_link ("Dover","DE", headers)
get_housing_info(response)

{'bed': ['4', '5', '4', '3', '3', '3', '2', '3', '0.86 acres lot'],
 'bath': ['2', '4', '2', '2', '2', '2', '2', '2'],
 'sqft': ['2,194',
  '3,750',
  '2,000',
  '1,820',
  '1,555',
  '1,488',
  '1,680',
  '1,512']}

**Scrape_housing**: This function will gather the price, address, number of bedrooms, sqft, and number of bathrooms from each location that the user specifies.
- Input: resulting response from access_link function
- Output: Returns a dataframe containing all the information associated with homes in the city and state specified by the user. Note, due to the response from the website, each request gets data on the first 10 homes.

**Notes**: To reduce issues with responses from Zillow, make sure you do not attempt to pull too much information at once. If you intend to pull more information, some adjustments will need to be made to the script. While time.sleep() is used in the access_link function to reduce issues with Zillow, it is still important to scrape responsibly.

In [None]:
def scrape_housing(response,state):
  prices = get_prices(response)
  addresses = get_addresses(response)
  dic_info = get_housing_info(response)
  dic_info['prices'] = prices
  dic_info['addresses'] = addresses
  dic_info['cities'] = []
  dic_info['state'] = []
  for i in addresses:
    dic_info['cities'].append(i.split(',')[1])
    dic_info['state'].append(state)
  df = pd.DataFrame.from_dict(dic_info)
  return df

In [None]:
# Example usage of function
df_nyc = scrape_housing(response,"NY")

In [None]:
# saving result to a csv file
df_nyc.to_csv("nyc.csv")

**Pulling state abbreviations**:To keep consistency when choosing cities for each state, we will be using the capitals associated with each state. We also want to pull this information and store it in one location, to make it easier to pass state abbreviations and cities into our functions.
**Storing information in dictionary**: We will store this information in a list of tuples, such that the first value is the state abbreviation and the second value is the capital.

In [None]:
# states and capitals scrape :)
url = "https://bigdave44.com/features/the-mine/us-states-abbreviations-capitals-nicknames/"
response = requests.get(url,headers=headers)

In [None]:
data = BeautifulSoup(response.text,'html')

In [None]:
lis = []
states = data.find_all('tr')
for i in states:
  lis.append(i.text.strip())
states=[]
for i in lis[1:51]:
  states.append(i.split('\n')[1])
cities=[]
for i in lis[1:51]:
  cities.append(i.split('\n')[2])
dic_cities = {}
dic_cities['state']=states
dic_cities['cities']=cities

In [None]:
# Storing capitals and abbreviations in a dataframe
cities_df = pd.DataFrame.from_dict(dic_cities)
city_info = list(zip(states,cities))
city_info

[('AL', 'Montgomery'),
 ('AK', 'Juneau'),
 ('AZ', 'Phoenix'),
 ('AR', 'Little Rock'),
 ('CA', 'Sacramento'),
 ('CO', 'Denver'),
 ('CT', 'Hartford'),
 ('DE', 'Dover'),
 ('FL', 'Tallahassee'),
 ('GA', 'Atlanta'),
 ('HI', 'Honolulu'),
 ('ID', 'Boise'),
 ('IL', 'Springfield'),
 ('IN', 'Indianapolis'),
 ('IA', 'Des Moines'),
 ('KS', 'Topeka'),
 ('KY', 'Frankfort'),
 ('LA', 'Baton Rouge'),
 ('ME', 'Augusta'),
 ('MD', 'Annapolis'),
 ('MA', 'Boston'),
 ('MI', 'Lansing'),
 ('MN', 'St. Paul'),
 ('MS', 'Jackson'),
 ('MO', 'Jefferson City'),
 ('MT', 'Helena'),
 ('NE', 'Lincoln'),
 ('NV', 'Carson City'),
 ('NH', 'Concord'),
 ('NJ', 'Trenton'),
 ('NM', 'Santa Fe'),
 ('NY', 'Albany'),
 ('NC', 'Raleigh'),
 ('ND', 'Bismarck'),
 ('OH', 'Columbus'),
 ('OK', 'Oklahoma City'),
 ('OR', 'Salem'),
 ('PA', 'Harrisburg'),
 ('RI', 'Providence'),
 ('SC', 'Columbia'),
 ('SD', 'Pierre'),
 ('TN', 'Nashville'),
 ('TX', 'Austin'),
 ('UT', 'Salt Lake City'),
 ('VT', 'Montpelier'),
 ('VA', 'Richmond'),
 ('WA', 'Olympia'

In [None]:
# Storing information from Hawaii
response_hi = access_link ("HI","Hilo", headers)
dic_hi = scrape_housing(response_hi,"HI")
df = pd.DataFrame.from_dict(dic_info)


In [None]:
dic_hi
dic_de['bed'] = ['4', '5', '4', '3', '3', '3', '2', '3', '--']
dic_de['bath'] = ['2', '4', '2', '2', '2', '2', '2', '2','--']
dic_de['sqft'] = ['2,194',
  '3,750',
  '2,000',
  '1,820',
  '1,555',
  '1,488',
  '1,680',
  '1,512', '--']


{'bed': ['5', '3', '0.53 acres lot', '2', '3', '3', '6'],
 'bath': ['3', '3', '0.54 acres lot', '1,212', '2,664', '2', '3'],
 'sqft': ['2,202', '1,862', '3', '4', '0.43 acres lot', '1,355', '1,600'],
 'prices': ['770000',
  '719000',
  '249000',
  '330000',
  '1199000',
  '1999995',
  '649000',
  '850000',
  '680000'],
 'addresses': ['50 Malia St, Hilo, HI 96720',
  '1172 Kumukoa St, Hilo, HI 96720',
  '1139 Kaumana Dr, Hilo, HI 96720',
  '3914 Kila Pl LOT 6-C, Hilo, HI 96720',
  '2048 Kalanianaole St, Hilo, HI 96720',
  '229 Maikai St, Hilo, HI 96720',
  '106 Nene St, Hilo, HI 96720',
  '30 Koula St, Hilo, HI 96720',
  '1280 Kumukoa St, Hilo, HI 96720'],
 'cities': [' Hilo',
  ' Hilo',
  ' Hilo',
  ' Hilo',
  ' Hilo',
  ' Hilo',
  ' Hilo',
  ' Hilo',
  ' Hilo'],
 'state': ['HI', 'HI', 'HI', 'HI', 'HI', 'HI', 'HI', 'HI', 'HI']}

**Combining states**: The initial states gathered have been stored in in separate csv files. In order to begin our analysis, we must combine the dataframes. The remaining states and their corresponding data was collected by my partners.

In [None]:
df_ct = pd.DataFrame.from_dict(dic_ct)
df_al = pd.read_csv("alabama.csv")
df_alas = pd.read_csv("alaska.csv")
df_ar = pd.read_csv("arizona.csv")
df_cali = pd.read_csv("cali.csv")
df_col = pd.read_csv("coloroo.csv")
df_geo = pd.read_csv("georgia.csv")
df_indi = pd.read_csv("indiana.csv")
df_ka = pd.read_csv("kansas.csv")
df_nyc = pd.read_csv("nyc.csv")

lis = [df_al,df_alas,df_ar,df_cali,df_col,df_geo,df_nyc, df_ct]
df = pd.concat(lis)
df_az = pd.DataFrame.from_dict(dic_ark)
df_az.to_csv("arizona.csv")

In [None]:
housing = pd.read_csv('housing.csv')
df.drop(df.columns[[0, 1]], axis=1, inplace=True)
df.to_csv('housing.csv')

In [None]:
pd.read_csv('housing.csv')
lis = [housing,df]
df = pd.concat(lis)
df.drop(df.columns[[0,1]],axis =1, inplace = True)

In [None]:
df.to_csv("housing.csv")
pd.DataFrame.from_dict(df_ct)
colorodo_df.to_csv("coloroo.csv")
df.to_csv('housing.csv')

**Access_link (alternate)**: Due to issues with getting requests from Zillow, an alternate access_link() function was written. This function varies the User-Agent in the request, and increases the time the program takes between requests. This function was used to gather the final few states since we ran into more issues while gathering data.

In [None]:
access_link (city,state, header_in)
scrape_housing(response,state)
lis =[]
count = 0
for i in city_info[2:17]:
  count +=1
  if count ==5:
    time.sleep(random.randint(10,25))
  elif count %2==0:
    headers = {'User-Agent': 'hello-world@yahoo.com'}
    response = access_link (i[1],i[0], headers)
    dic = scrape_housing(response,i[0])
    lis.append(dic)
  elif count %3==0:
    headers = {'User-Agent': 'dinosaur@gmail.com'}
    response = access_link (i[1],i[0], headers)
    dic = scrape_housing(response,i[0])
    lis.append(dic)
  elif count %4==0:
    headers = {'User-Agent': 'long-neck@yahoo.com'}
    response = access_link (i[1],i[0], headers)
    dic = scrape_housing(response,i[0])
    lis.append(dic)
  elif count %5==0:
    headers = {'User-Agent': 'bread-winter@gmail.com'}
    response = access_link (i[1],i[0], headers)
    dic = scrape_housing(response,i[0])
    lis.append(dic)
  else:
    headers = {'User-Agent': 'goodbye-world@gmail.com'}
    response = access_link (i[1],i[0], headers)
    dic = scrape_housing(response,i[0])
    lis.append(dic)