### Import Libraries 

 - Request library allows you to send HTTP request in python to a specific URL. In our case we send an HTTP request to Zillow
 - Time module allows to handle time related task including formatting dates, waiting and representing time
 - The random module allows you to generate random 
 - The bs4 module allows you to pull data from HTML document after you get a response from HTTP request
 - The os modules allows ou to interact with operating systems including changing working directory
 - The selenium module allows you to automate interaction with a web browser including sending URL request and extracting HTML
   document response

In [1]:
import requests
import time
from bs4 import BeautifulSoup
from random import sample 
import pandas as pd 
import os
from selenium import webdriver
import json
import csv
from datetime import datetime
import re


### Set Path
 - Identify your destination folder
 - Use os change directory to set your destination directory as the default. That is where all outputs will be exported to

In [2]:
path = "../webscraping_outputs-Z"
os.chdir(path)

### Create a file name
 - Create an outfile file name, I called mine ZillowSelium and formatted it a date time stamp
 - Note: If you are scraping multiple times in a day, then you need to format the time stamp with hours that way you don't overwrite already exported data

In [3]:
finalfile = "ZillowSelenium" + "_" + "{:%Y_%m_%d_%h_%m}".format(datetime.now()) +".csv"
finalfile

'ZillowSelium_2023_11_26_Nov_11.csv'

### Main Webscraping 

- Output results
- Page numbers
- URL 
- Selenium Setup


In [11]:
#Create a list that will hold the results

results = []


# Inspect the zillow website and figure out the number pages for rental ads use
# In the charlotte example, there are a total of 20 pages so I set the range at 21

for page in range(1,21,1):
    
    print("This is page: " + str(page))
    
    #Identify the Zillow URL of your City, it should follow this format:
    # 1. Default Zillow url : https://www.zillow.com/
    # 2. Name of your City: eg. charlotte-nc, atlanta-ga
    # 3. Pass the page number 
    # 4. Add the "_p" that is a default thing with the Zillow website 
    # 5. In a sample URL on page 15 for example will be like: https://www.zillow.com/charlotte-nc/rentals/15_p/
    
    url = "https://www.zillow.com/charlotte-nc/rentals/" +str(page) + '_p/'
    # testing philly area
    old_rul=url
    url="https://www.zillow.com/philadelphia-pa/rentals/" +str(page) + '_p/'
    
    # Here we are going to utilize the selenium. To automate the interaction behavior of a web browser you would
    # need a web driver. Each browser has a webdriver, in my case I am using google chrome so I download the web driver
    # from this website "https://chromedriver.storage.googleapis.com/index.html?path=98.0.4758.80/" 
    
    # After downloading and extracting the web drive(chromdriver.exe) you use the webdrive.Chrome() method to initiate
    # the chrome browser and pass the path where the driver is saved.
    
    
    CraiglistBrowser = webdriver.Chrome()
    CraiglistBrowser.maximize_window()
    
    # After the browser has been launched use the get() to pass the url 
    Craiglist = CraiglistBrowser.get(url)
    CraiglistHTML = CraiglistBrowser.execute_script("return document.documentElement.outerHTML")
    soup = BeautifulSoup(CraiglistHTML, 'html.parser')
    CraiglistBrowser.quit()
    print(url)

    # old classes not present: photo-cards photo-cards_wow photo-cards_short 
    deck = soup.find('ul',{'class': 'StyledPropertyCardHomeDetailsList-c11n-8-84-3__sc-1xvdaej-0 eYPFID'})

    for card in deck.contents: 
        script = card.find('script',  {'type': 'application/ld+json'})
        print(script)


        try: 
            if script:
            
                script_json = json.loads(script.contents[0])
               
                try:
                    descriptions = script_json['url']
                    CraiglistBrowser = webdriver.Chrome()
                    CraiglistBrowser.maximize_window()
                    Craiglist = CraiglistBrowser.get(descriptions)
                    CraiglistHTML = CraiglistBrowser.execute_script("return document.documentElement.outerHTML")
                    soup = BeautifulSoup(CraiglistHTML, 'html.parser')
                    CraiglistBrowser.quit()

                except:
                    # ad-hoc print to show we didn't get the description
                    print("passed the script block")
                    pass

                loop_soup = BeautifulSoup(CraiglistHTML, 'html.parser')

                try:
                    loopresults2 = loop_soup.find('div', {'class' : 'ds-overview-section'}).text


                except:
                    # ad-hoc print to show we didn't get the description
                    print("passed the overview retrieval")
                    pass


                results.append({
                                    'latitude': script_json['geo']['latitude'],
                                    'longitude': script_json['geo']['longitude'],
                                    'floorsize': script_json['floorSize']['value'],
                                    'streetaddress': script_json['name'],
                                    'zipcode': script_json['address']['postalCode'],
                                    'Locality': script_json['address']['addressLocality'],
                                    'url': script_json['url'],
                                    'price': card.find('div', {'class': 'list-card-price'}).text,
                                    'bedrooms': card.find('ul',{'class': 'list-card-details'}).text[0],
                                    'bedroomsLab': card.find('ul',{'class': 'list-card-details'}).find('li', {'class': ''}).text,
                                    'baths': card.find('ul',{'class': 'list-card-details'}).text[5],
                                    'overview' : loopresults2
                                })
        except KeyError :
            pass
        time.sleep(5)

        Zillowdata =  pd.DataFrame(results)
        Zillowdata.to_csv(finalfile, index = False)


This is page :1
https://www.zillow.com/philadelphia-pa/rentals/1_p/
None
None
This is page :2
https://www.zillow.com/philadelphia-pa/rentals/2_p/


AttributeError: 'NoneType' object has no attribute 'contents'

## Testing to find proper class to look for

In [9]:
CraiglistBrowser = webdriver.Chrome()
# CraiglistBrowser.maximize_window()

url="https://www.zillow.com/philadelphia-pa/rentals/" +str(1) + '_p/'

# After the browser has been launched use the get() to pass the url 
Craiglist = CraiglistBrowser.get(url)
CraiglistHTML = CraiglistBrowser.execute_script("return document.documentElement.outerHTML")
soup = BeautifulSoup(CraiglistHTML, 'html.parser')
# CraiglistBrowser.quit()
print(url)

# old classes not present: photo-cards photo-cards_wow photo-cards_short 
deck = soup.find('ul',{'class': 'StyledButton-c11n-8-84-3__sc-1xvdaej-0 eYPFID'})

print(deck.prettify())

for card in deck.contents: 
    script = card.find('script',  {'type': 'application/ld+json'})
    print(f"script:\n\n{script}")

https://www.zillow.com/philadelphia-pa/rentals/1_p/


AttributeError: 'NoneType' object has no attribute 'prettify'

In [23]:
type(soup.find('ul',string=re.compile(".bds.")))

NoneType

In [17]:
BeautifulSoup.findAll?

[1;31mSignature:[0m
[0mBeautifulSoup[0m[1;33m.[0m[0mfindAll[0m[1;33m([0m[1;33m
[0m    [0mself[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mattrs[0m[1;33m=[0m[1;33m{[0m[1;33m}[0m[1;33m,[0m[1;33m
[0m    [0mrecursive[0m[1;33m=[0m[1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mstring[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlimit[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m**[0m[0mkwargs[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Look in the children of this PageElement and find all
PageElements that match the given criteria.

All find_* methods take a common set of arguments. See the online
documentation for detailed explanations.

:param name: A filter on tag name.
:param attrs: A dictionary of filters on attribute values.
:param recursive: If this is True, find_all() will perform a
    recursive search of this PageEl

In [7]:
soup.findAll('ul')

[<ul class="pfs__sc-1elvxkv-1 pfs__sc-1elvxkv-2 pfs__sc-98mhij-0 fItEqY eSokiI fNAqaU" data-display-my-zillow="true" data-zg-section="reg-login"><li class="pfs__sc-585qe5-0"><a class="Anchor-c11n-8-62-4__sc-hn4bge-0 pfs__sc-1dpbk03-0 bxnNjh bmhFIr" data-active="false" data-za-action="Sign in" data-za-category="!inherit" data-zg-role="section-title" href="/user/acct/login/?cid=pf"><div class="pfs__sc-1etb9mm-1 gQIEWT">Sign In</div></a></li><li class="pfs__sc-585qe5-0"><a class="Anchor-c11n-8-62-4__sc-hn4bge-0 pfs__sc-1dpbk03-0 bxnNjh bmhFIr" data-active="false" data-za-action="Join" data-za-category="!inherit" data-zg-role="section-title" href="/user/acct/register/?cid=pf"><span>Join<!-- --> </span></a></li></ul>,
 <ul class="pfs__sc-1elvxkv-1 pfs__sc-1elvxkv-2 pfs__sc-1wickoz-0 fItEqY eSokiI cEPJeu" data-zg-section="main"><li class="pfs__sc-585qe5-0"><a class="Anchor-c11n-8-62-4__sc-hn4bge-0 pfs__sc-1dpbk03-0 bxnNjh bmhFIr noroute" data-active="false" data-za-action="Buy" data-za-categ

In [7]:
time.sleep(0.0006)