In this script, I will be scraping data from realtor.com. I am going to get data on all single family homes for sale and all single family homes for rent in El Paso Texas or whatever other cities are included in the search.

This site gave me a basis on how I could web scrape data on Python: https://www.scrapingdog.com/blog/scrape-zillow/

In [1]:
import requests #This library allows me to retrieve data from a particular site
from bs4 import BeautifulSoup #BS allows me to get specific data from a parsed data tree on the site
import pandas as pd

In the code below, I create a list of urls because the site contains multiple pages of rental properties, so I want to navigate through all listings.

In [2]:
REALTOR_BASE_URL = "https://www.realtor.com/apartments/El-Paso_TX/type-single-family-home/pg-"

url_list = [f"{REALTOR_BASE_URL}{i}" for i in range(1, 15)]

In the next few cells, I import an empty excel file where I will be inputting all the gathered data and I declare some arrays to input the respective data.

In [3]:
EPRentals = pd.read_excel(r"C:\Users\marco\OneDrive\Documents\Application resources\Python and Pandas\El Paso Rentals.xlsx")

In [4]:
EPRentals

In [5]:
Rent = []
RentalAddress = []
RentalCity = []
RentalZipCode = []
RentalBedrooms = []
RentalBathrooms = []
RentalSquareFeet = []

In the next cell I have a for loop run through each url in the urllist. For every url I changed the 'requests' and 'BeautifulSoup' (BS) functions to match the url. Next, depending on the section and the parsed data on the site, I create a for loop using the BS funtion to find all the datapoints for the given url for rent, address, city, zipcode, bedrooms, bathrooms, and squarefeet, and append them all to their respective array.

For all 'BS' functions I'm returned with a string, since for some of the data I eventually want to convert the data type to either an integer or float, I get rid of particular text using the 'replace' function.

In [6]:
header = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"}
for url in url_list:
    
    resp = requests.get(url = url, headers = header)
    print(resp.status_code)
    soup = BeautifulSoup(resp.text,'lxml')
    
    for rent in soup.findAll('div',{'class': 'Pricestyles__StyledPrice-rui__btk3ge-0 kjbIiZ card-price'}):
        rent = rent.text
        rent = rent.replace("$", "")
        rent = rent.replace(",", "")
        rent = int(rent)

        Rent.append(rent)
        
        
    for bed in soup.findAll('li', {'data-testid' : 'property-meta-beds'}):
        bed = bed.text
        bed = bed.replace("bed", "")
        #bed = float(bed)
        
        RentalBedrooms.append(bed)
        
        
        
        
    for bath in soup.findAll('li', {'data-testid' : 'property-meta-baths'}):
        bath = bath.text
        bath = bath.replace("bath", "")
        #bath = float(bath)

        RentalBathrooms.append(bath)
        
        
        
        
    for sqft in soup.findAll('ul', {'class' : 'PropertyMetastyles__StyledPropertyMeta-rui__sc-1g5rdjn-0 iQEvdK card-meta'}):
        sqft = sqft.text
        sqft = sqft.replace("bath", "bath ")
        sqft = sqft.replace("sqft", "sqft ")  

        if "sqft" in sqft:
            sqft = sqft.split()
            for i in sqft:
                if "sqft" in i:
                    i = i.replace("sqft", "")
                    i = i.replace(",", "")
                    squarefeet = i
                    #squarefeet = int(i)
        else:
            squarefeet = "0"

        RentalSquareFeet.append(squarefeet)
        
        
        
        
    for address in soup.findAll('div', {'class' : 'content-col-left'}):
        address = address.text
        address = address.replace("El Paso, TX", " El Paso, TX")
        address = address.replace("Horizon", " Horizon")

        if "El Paso" in address:
            RentalCity.append("El Paso")
        elif "Horizon City" in address:
            RentalCity.append("Horizon City")
        else:
            RentalCity.append("Unkown")

        zipcode = address.split()
        zipcode = zipcode[len(zipcode)-1]
        RentalZipCode.append(zipcode)

        if "El Paso" in address:
            address = address.partition("El Paso")
        if "Horizon City" in address:
            address = address.partition("Horizon City")

        RentalAddress.append(address[0])
        


200
200
200
200
200
200
200
200
200
200
200
200
200
200


Once all the webscraping code is done, I add the data to the dataframe.

In [14]:
EPRentals['Address'] = RentalAddress
EPRentals['City'] = RentalCity
EPRentals['ZipCode'] = RentalZipCode
EPRentals['Bedrooms'] = RentalBedrooms
EPRentals['Bathrooms'] = RentalBathrooms
EPRentals['SquareFeet'] = RentalSquareFeet
EPRentals['Rent'] = Rent

I'm going to run through a few of the rows in the table to make sure that the data does match and the code was executed as intended.

In [15]:
EPRentals

Unnamed: 0,Address,City,ZipCode,Bedrooms,Bathrooms,SquareFeet,Rent
0,6838 St Lo Rd,El Paso,79925,2,1,944,1000
1,3744 Loma Jacinto Dr,El Paso,79938,3,2,1468,1500
2,529 Agua De Brisa,Horizon City,79928,3,2,1591,1200
3,2400 Tierra Mia Way,El Paso,79938,3,2,1519,1295
4,"Queens MHC6111 Sun Valley Dr Trlr 277,",El Paso,79924,3,2,1152,1200
...,...,...,...,...,...,...,...
581,11957 Mesquite Rock Dr,El Paso,79934,3,2,1380,1495
582,2517 San Jose Ave Unit A,El Paso,79930,3,1,820,1300
583,5406 Dalton Ave,El Paso,79924,3,1.5,1092,1175
584,14665 Boer Trail Ave,El Paso,79938,3,2,1340,1700


Now I am going to go through a similar process for the homes for sale as I did for the rentals.

Importing an empty spreadsheet
Declaring all arrays
Creating my list of urls

In [16]:
EPRealEstate = pd.read_excel(r"C:\Users\marco\OneDrive\Documents\Application resources\Python and Pandas\El Paso Real Estate.xlsx")

In [17]:
EPRealEstate

In [18]:
Price = []
Address = []
City = []
ZipCode = []
Bedrooms = []
Bathrooms = []
SquareFeet = []
Details = []

In [19]:
url_list = []
header = {"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"}

for i in range(1,79):
    
    url_list.append("https://www.realtor.com/realestateandhomes-search/El-Paso_TX/pg-" + str(i))
    

In the following cell, I have my forloop gather and append all the data from every url.

I attempted to gather more specific data from the site which would allow me to go through fewer steps in getting the end result, but some of the listings had varying data trees which resulted in some of the arrays being of different sized thatn others. Becuase I can't have arrays of different sizes, since data wouldn't match and I wouldn't be able to import it properly to the spreadsheet, I had to gather a bulk of data from a location that all listings have and find my own method of cleaning it. For each category you can see some of the steps I took to clean the data.

You may notice I still don't change the data type, I will be completing that after I imported all the data into the spreadsheet so I can get rid of complete faulty rows.

I have all the loops under one cell so that the script can run faster.

In [20]:
for url in url_list:
    
    resp = requests.get(url = url, headers = header)
    print(resp.status_code)
    soup = BeautifulSoup(resp.text,'lxml')
    
    for a in soup.findAll('div', {'class' : 'Pricestyles__StyledPrice-rui__btk3ge-0 bvgLFe card-price'}):
        price = a.text
    
        price = price.replace("$", "")
        price = price.replace("From", "")
        price= price.replace(",", "")
        price = int(price)
    
        Price.append(price)
        
    for b in soup.findAll('div', {'class' : 'content-col-left'}):
        address = b.text
    
        address = address.replace("El Paso, TX", " El Paso, TX")
        address = address.replace("Horizon", " Horizon")
    
        if ("El Paso, TX" in address) == True:
            address = address.partition('El Paso, TX')
        
        if ("Horizon City, TX" in address) == True:
            address = address.partition('Horizon City, TX')

        Address.append(address[0])
        City.append(address[1])
        ZipCode.append(address[2])
        
        
        
        
    for details in soup.findAll('ul', {'data-testid' : 'card-meta'}):
        details = details.text

        details = details.replace("bed", "bed ")
        details = details.replace("bath", "bath ")
        details = details.replace("sqft", "sqft ")
        details = details.split()
        
        if len(details) > 3:
            
            if "bed" in details[0]:

                beds = details[0]
                beds = beds.replace("bed", "")
                #beds = int(beds)
            else:
                Beds = 'None'
            Bedrooms.append(beds)



            if "bath" in details[1]:

                baths = details[1]
                baths = baths.replace("bath", "")
                #baths = float(baths)
            else:
                Baths = 'None'        
            Bathrooms.append(baths)



            if "sqft" in details[2]:

                sqft = details[2]
                sqft = sqft.replace("sqft", "")
                sqft = sqft.replace(",", "")
                #sqft = int(sqft)
            else:
                sqft = 'None'
            SquareFeet.append(sqft)
            
            
            
        else:
            Beds = 'None'
            sqft = 'None'
            Baths = 'None'  
            SquareFeet.append(sqft)
            Bathrooms.append(baths)
            Bedrooms.append(beds)
            


200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
403
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
403
200
200
403
403
403
403
403
403
403


I now do the same as before: 
Add the lists to the spreadsheet
Make sure the data matches

In [28]:
EPRealEstate['Address'] = Address
EPRealEstate['City'] = City
EPRealEstate['ZipCode'] = ZipCode
EPRealEstate['Bedrooms'] = Bedrooms
EPRealEstate['Bathrooms'] = Bathrooms
EPRealEstate['SquareFeet'] = SquareFeet
EPRealEstate['Price'] = Price

In [29]:
EPRealEstate

Unnamed: 0,Address,City,ZipCode,Bedrooms,Bathrooms,SquareFeet,Price
0,6605 Quail Cove Ct,"El Paso, TX",79912,4,3,2224,280000
1,231 Sofia Pl,"El Paso, TX",79907,4,1,1520,150000
2,408 S Festival Dr,"El Paso, TX",79912,3,2.5,2318,525000
3,11050 Breeze Ct,"El Paso, TX",79936,3,2,1101,129995
4,14470 Fernando Zubia Ave,"El Paso, TX",79938,3,2,1928,264999
...,...,...,...,...,...,...,...
2893,885 Harpendem Dr,"El Paso, TX",79928,4,3,1687,270950
2894,1800 Ritzy Lou Pl,"El Paso, TX",79928,3,2,1888,322950
2895,Montana Ave,"El Paso, TX",79938,3,2,,719000
2896,305 Playa Vista St,"El Paso, TX",79928,4,2.5,1954,346950


Now I change some of the data types that need to be floats or ints and transfer the reulting datafram to an excel file.

In [83]:
EPRentals['Bedrooms'] = pd.to_numeric(EPRentals['Bedrooms'], errors = 'coerce')
EPRentals['Bathrooms'] = pd.to_numeric(EPRentals['Bathrooms'], errors = 'coerce')
EPRentals['SquareFeet'] = pd.to_numeric(EPRentals['SquareFeet'], errors = 'coerce')
EPRentals['Rent'] = pd.to_numeric(EPRentals['Rent'], errors = 'coerce')
EPRentals['ZipCode'] = pd.to_numeric(EPRentals['ZipCode'], errors = 'coerce')
EPRentals.dtypes

Address        object
City           object
ZipCode         int64
Bedrooms      float64
Bathrooms     float64
SquareFeet      int64
Rent            int64
dtype: object

In [84]:
EPRealEstate['Bedrooms'] = pd.to_numeric(EPRealEstate['Bedrooms'], errors = 'coerce')
EPRealEstate['Bathrooms'] = pd.to_numeric(EPRealEstate['Bathrooms'], errors = 'coerce')
EPRealEstate['SquareFeet'] = pd.to_numeric(EPRealEstate['SquareFeet'], errors = 'coerce')
EPRealEstate['Price'] = pd.to_numeric(EPRealEstate['Price'], errors = 'coerce')
EPRealEstate['ZipCode'] = pd.to_numeric(EPRealEstate['ZipCode'], errors = 'coerce')
EPRealEstate.dtypes

Address        object
City           object
ZipCode         int64
Bedrooms        int64
Bathrooms     float64
SquareFeet    float64
Price           int64
dtype: object

In [85]:
EPRealEstate.to_excel("EPRealEstateData.xlsx", sheet_name = 'AllData')

In [86]:
EPRentals.to_excel("EPRentalData.xlsx", sheet_name = 'AllData')