# Craigslist Apartment Scraper

The purpose of this script is to pull apartment listings, characteristics, prices, and reply emails from for rent ads on craigslist. We plan to use this information in order to run an experiment to test the impact of including exclamation points on response rates to inquiries sent.

In [1]:
#Import modules
import requests
from bs4 import BeautifulSoup as bs4
import pandas as pd
import re
import numpy as np

## Function to query craigslist  

This function will allow us to specify a price range, the number of bedrooms, and what craigslist site to query (e.g. Denver, SF, NYC, etc.)  

Note that these queries only return a max of 100 results each. Thus, we will want to be specific about the price ranges and bedrooms that we specify so we can maximize the number of listings we are able to capture.

In [2]:
#Define a function to fetch search results
def fetch_search_results(query=None, minAsk=None, maxAsk=None, bedrooms=None, base_URL = None):
    search_params = {key: val for key, val in locals().items() if val is not None}
    if not search_params:
        raise ValueError("No valid keywords")
    base = base_URL + '/search/apa'
    resp = requests.get(base, params=search_params, timeout=3)
    resp.raise_for_status()  # <- no-op if status==200
    return resp.content, resp.encoding

In [3]:
#test the query function.
test1, test2 = fetch_search_results(query = None, minAsk = 1000, maxAsk = 4000, bedrooms = 1, base_URL = 'https://denver.craigslist.org')

## Function to get full URLs and apartment characteristics from query function output  

This function will go through each of the listings found from our query and compile a dataset of URLs and apartment characteristics of all the results from the query. We will use the URLs to get the reply email addresses in a later step.

### Helper functions to get apartment characteristics from each query result  

price, bedrooms, square footage, listing title, posting date / time, and reply linnk

In [4]:
#get href - the relative link to the full apartment listing. These relative links are identified by <a> tags
#and have the class 'result-title hdrlnk'.
def get_href(result):
    href = result.find('a', {'class' : 'result-title hdrlnk'})['href']
    
    if href is None:
        href = np.nan
    
    return href

In [5]:
#get price - price can be located by <span> tags of class 'result-price'
def get_price(result):
    price = result.find('span', {'class' : 'result-price'})
    
    #convert price to float
    if price is not None:
        price = float(price.text.strip('$'))
        
    else:
        price = np.nan
    
    return price

In [6]:
#get listing title which is identified by the text in the <a> tag with class 'result-title hdrlnk'
def get_title(result):
    title = result.find('a', {'class' : 'result-title hdrlnk'}).text
    
    if title is None:
        title = np.nan
        
    return title


In [7]:
#get the time the listing was posted
def get_posting_date(result):
    posting_date = result.find('time', {'class' : 'result-date'})['datetime']
    
    if posting_date is None:
        posting_date = np.nan
        
    return posting_date

In [54]:
#get bedrooms / sqft which is identified by the <span> tag of class 'housing'
def get_bedrooms_sqft_str(result):
    bedrooms_sqft = result.find('span', {'class' : 'housing'}).text.strip('\n')
    
    if bedrooms_sqft is None:
        price = np.nan
    
    return bedrooms_sqft

def get_bedrooms_sqft(bedrooms_sqft):
    #*******
    #remove the new line characters and white space
    p_1 = re.compile('-|\n|\s')

    bedrooms_sqft = p_1.sub('', bedrooms_sqft)

    #*******
    #get bedrooms
    #compile the regex
    bedroom_p = re.compile(r'\d+(?=br)', re.IGNORECASE)

    #get match in the bedroom / sqft string
    bedroom_m = bedroom_p.match(bedrooms_sqft)

    #get bedrooms
    n_bedrooms = float(bedrooms_sqft[bedroom_m.start(): bedroom_m.end()])

    #*******
    #get square footage
    #remove bedrooms
    bedrooms_sqft = bedrooms_sqft[bedroom_m.end() + 2:]

    #compile the regex
    sqft_p = re.compile(r'\d+(?=ft)', re.IGNORECASE)

    #get match in the square footage string
    sqft_m = sqft_p.match(bedrooms_sqft)

    #get square footage
    try:
        sqft = float(bedrooms_sqft[sqft_m.start():sqft_m.end()])
    
    except AttributeError:
        sqft = np.nan
    
    return n_bedrooms, sqft


### Function to compile all apartment characteristics

In [57]:
def compile_listing_URLs(query_result, base_URL):
    #parse the results of the query
    html = bs4(query_result, 'html.parser')

    #get all individual apartments from the query
    apt_results = html.find_all('p', attrs={'class' : 'result-info'})

    #initialize a list to contain all of the URLs that resulted from the query
    apts_results_df = pd.DataFrame(columns = ('base_URL', 'href', 'Listing_Title', 'Bedrooms', 'Sqft', 'Price', 'Posting_Date'))
   
    #Looop through all of the tags containing the apartments and get the addresses of those individual results.
    for apt in range(len(apt_results)):
        #use helper functions to get characteristics
        href = get_href(apt_results[apt])
        title = get_title(apt_results[apt])
        bedrooms_sqft_str = get_bedrooms_sqft_str(apt_results[apt])
        bedrooms, sqft = get_bedrooms_sqft(bedrooms_sqft_str)
        price = get_price(apt_results[apt])
        posting_date = get_posting_date(apt_results[apt])
        #populate the result dataframe with the characteristics
        apts_results_df.loc[apt] = [base_URL, href, title, bedrooms, sqft, price, posting_date]

    #construct full URL for the listing
    apts_results_df['full_URL'] = apts_results_df.apply(lambda row: row['base_URL'] + row['href'], axis = 1)
    
    #construct reply URL for the listing
    apts_results_df['Reply_contact_info_link'] = apts_results_df.apply(lambda row: row['base_URL'] + '/reply/den' + row['href'].strip('.html'), axis = 1)
    
    #delete base URL and href columns
    del apts_results_df['base_URL']
    del apts_results_df['href']
    
    return apts_results_df

In [58]:
#test the compiler function
test_compiled_URLs = compile_listing_URLs(query_result = test1, base_URL = 'https://denver.craigslist.org')

test_compiled_URLs.head()

pd.set_option('display.max_colwidth',1000)

test_compiled_URLs.head()

Unnamed: 0,Listing_Title,Bedrooms,Sqft,Price,Posting_Date,full_URL,Reply_contact_info_link
0,Lease with Option to Purchase! Great Centennial Home,3.0,1431.0,2220.0,2017-02-20 05:28,https://denver.craigslist.org/apa/6011368687.html,https://denver.craigslist.org/reply/den/apa/6011368687
1,"Pets Allowed, Pool, Cable/Satellite",1.0,736.0,1165.0,2017-02-20 05:27,https://denver.craigslist.org/apa/5999792572.html,https://denver.craigslist.org/reply/den/apa/5999792572
2,"Window Coverings, Patio, Package Receiving",1.0,673.0,1270.0,2017-02-20 05:26,https://denver.craigslist.org/apa/5969798331.html,https://denver.craigslist.org/reply/den/apa/5969798331
3,Lease w/ Option to Purchase -- Centennial Home,4.0,3293.0,2600.0,2017-02-20 05:17,https://denver.craigslist.org/apa/6011361881.html,https://denver.craigslist.org/reply/den/apa/6011361881
4,UNIVERSITY OF DENVER- ACROSS FROM DU LAW SCHOOL- MAR 1,2.0,900.0,1400.0,2017-02-20 05:11,https://denver.craigslist.org/apa/6011350994.html,https://denver.craigslist.org/reply/den/apa/6011350994


# Notes  

Craigslist started giving me captcha dialog boxes to view the reply email addresses, so we may have to pull reply email addresses by hand.